linux/kernel/irq/affinity.c

516 lines
13 KiB
C
Raw Normal View History

License cleanup: add SPDX GPL-2.0 license identifier to files with no license Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
// SPDX-License-Identifier: GPL-2.0
/*
* Copyright (C) 2016 Thomas Gleixner.
* Copyright (C) 2016-2017 Christoph Hellwig.
*/
#include <linux/interrupt.h>
#include <linux/kernel.h>
#include <linux/slab.h>
#include <linux/cpu.h>
genirq/affinity: Spread vectors on node according to nr_cpu ratio Now __irq_build_affinity_masks() spreads vectors evenly per node, but there is a case that not all vectors have been spread when each numa node has a different number of CPUs which triggers the warning in the spreading code. Improve the spreading algorithm by - assigning vectors according to the ratio of the number of CPUs on a node to the number of remaining CPUs. - running the assignment from smaller nodes to bigger nodes to guarantee that every active node gets allocated at least one vector. This ensures that all vectors are spread out. Asided of that the spread becomes more fair if the nodes have different number of CPUs. For example, on the following machine: CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 ... NUMA node0 CPU(s): 0,1,3,5-9,11,13-15 NUMA node1 CPU(s): 2,4,10,12 When a driver requests to allocate 8 vectors, the following spread results: irq 31, cpu list 2,4 irq 32, cpu list 10,12 irq 33, cpu list 0-1 irq 34, cpu list 3,5 irq 35, cpu list 6-7 irq 36, cpu list 8-9 irq 37, cpu list 11,13 irq 38, cpu list 14-15 So Node 0 has now 6 and Node 1 has 2 vectors assigned. The original algorithm assigned 4 vectors on each node which was unfair versus Node 0. [ tglx: Massaged changelog ] Reported-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Link: https://lkml.kernel.org/r/20190816022849.14075-3-ming.lei@redhat.com
2019-08-16 10:28:49 +08:00
#include <linux/sort.h>
static void irq_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
unsigned int cpus_per_vec)
{
const struct cpumask *siblmsk;
int cpu, sibl;
for ( ; cpus_per_vec > 0; ) {
cpu = cpumask_first(nmsk);
/* Should not happen, but I'm too lazy to think about it */
if (cpu >= nr_cpu_ids)
return;
cpumask_clear_cpu(cpu, nmsk);
cpumask_set_cpu(cpu, irqmsk);
cpus_per_vec--;
/* If the cpu has siblings, use them first */
siblmsk = topology_sibling_cpumask(cpu);
for (sibl = -1; cpus_per_vec > 0; ) {
sibl = cpumask_next(sibl, siblmsk);
if (sibl >= nr_cpu_ids)
break;
if (!cpumask_test_and_clear_cpu(sibl, nmsk))
continue;
cpumask_set_cpu(sibl, irqmsk);
cpus_per_vec--;
}
}
}
static cpumask_var_t *alloc_node_to_cpumask(void)
{
cpumask_var_t *masks;
int node;
masks = kcalloc(nr_node_ids, sizeof(cpumask_var_t), GFP_KERNEL);
if (!masks)
return NULL;
for (node = 0; node < nr_node_ids; node++) {
if (!zalloc_cpumask_var(&masks[node], GFP_KERNEL))
goto out_unwind;
}
return masks;
out_unwind:
while (--node >= 0)
free_cpumask_var(masks[node]);
kfree(masks);
return NULL;
}
static void free_node_to_cpumask(cpumask_var_t *masks)
{
int node;
for (node = 0; node < nr_node_ids; node++)
free_cpumask_var(masks[node]);
kfree(masks);
}
static void build_node_to_cpumask(cpumask_var_t *masks)
{
int cpu;
for_each_possible_cpu(cpu)
cpumask_set_cpu(cpu, masks[cpu_to_node(cpu)]);
}
static int get_nodes_in_cpumask(cpumask_var_t *node_to_cpumask,
const struct cpumask *mask, nodemask_t *nodemsk)
{
genirq/affinity: Fix node generation from cpumask Commit 34c3d9819fda ("genirq/affinity: Provide smarter irq spreading infrastructure") introduced a better IRQ spreading mechanism, taking account of the available NUMA nodes in the machine. Problem is that the algorithm of retrieving the nodemask iterates "linearly" based on the number of online nodes - some architectures present non-linear node distribution among the nodemask, like PowerPC. If this is the case, the algorithm lead to a wrong node count number and therefore to a bad/incomplete IRQ affinity distribution. For example, this problem were found in a machine with 128 CPUs and two nodes, namely nodes 0 and 8 (instead of 0 and 1, if it was linearly distributed). This led to a wrong affinity distribution which then led to a bad mq allocation for nvme driver. Finally, we take the opportunity to fix a comment regarding the affinity distribution when we have _more_ nodes than vectors. Fixes: 34c3d9819fda ("genirq/affinity: Provide smarter irq spreading infrastructure") Reported-by: Gabriel Krisman Bertazi <gabriel@krisman.be> Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Gabriel Krisman Bertazi <gabriel@krisman.be> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Cc: linux-pci@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: hch@lst.de Link: http://lkml.kernel.org/r/1481738472-2671-1-git-send-email-gpiccoli@linux.vnet.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-15 02:01:12 +08:00
int n, nodes = 0;
/* Calculate the number of nodes in the supplied affinity mask */
for_each_node(n) {
if (cpumask_intersects(mask, node_to_cpumask[n])) {
node_set(n, *nodemsk);
nodes++;
}
}
return nodes;
}
genirq/affinity: Spread vectors on node according to nr_cpu ratio Now __irq_build_affinity_masks() spreads vectors evenly per node, but there is a case that not all vectors have been spread when each numa node has a different number of CPUs which triggers the warning in the spreading code. Improve the spreading algorithm by - assigning vectors according to the ratio of the number of CPUs on a node to the number of remaining CPUs. - running the assignment from smaller nodes to bigger nodes to guarantee that every active node gets allocated at least one vector. This ensures that all vectors are spread out. Asided of that the spread becomes more fair if the nodes have different number of CPUs. For example, on the following machine: CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 ... NUMA node0 CPU(s): 0,1,3,5-9,11,13-15 NUMA node1 CPU(s): 2,4,10,12 When a driver requests to allocate 8 vectors, the following spread results: irq 31, cpu list 2,4 irq 32, cpu list 10,12 irq 33, cpu list 0-1 irq 34, cpu list 3,5 irq 35, cpu list 6-7 irq 36, cpu list 8-9 irq 37, cpu list 11,13 irq 38, cpu list 14-15 So Node 0 has now 6 and Node 1 has 2 vectors assigned. The original algorithm assigned 4 vectors on each node which was unfair versus Node 0. [ tglx: Massaged changelog ] Reported-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Link: https://lkml.kernel.org/r/20190816022849.14075-3-ming.lei@redhat.com
2019-08-16 10:28:49 +08:00
struct node_vectors {
unsigned id;
union {
unsigned nvectors;
unsigned ncpus;
};
};
static int ncpus_cmp_func(const void *l, const void *r)
{
const struct node_vectors *ln = l;
const struct node_vectors *rn = r;
return ln->ncpus - rn->ncpus;
}
/*
* Allocate vector number for each node, so that for each node:
*
* 1) the allocated number is >= 1
*
* 2) the allocated numbver is <= active CPU number of this node
*
* The actual allocated total vectors may be less than @numvecs when
* active total CPU number is less than @numvecs.
*
* Active CPUs means the CPUs in '@cpu_mask AND @node_to_cpumask[]'
* for each node.
*/
static void alloc_nodes_vectors(unsigned int numvecs,
cpumask_var_t *node_to_cpumask,
genirq/affinity: Spread vectors on node according to nr_cpu ratio Now __irq_build_affinity_masks() spreads vectors evenly per node, but there is a case that not all vectors have been spread when each numa node has a different number of CPUs which triggers the warning in the spreading code. Improve the spreading algorithm by - assigning vectors according to the ratio of the number of CPUs on a node to the number of remaining CPUs. - running the assignment from smaller nodes to bigger nodes to guarantee that every active node gets allocated at least one vector. This ensures that all vectors are spread out. Asided of that the spread becomes more fair if the nodes have different number of CPUs. For example, on the following machine: CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 ... NUMA node0 CPU(s): 0,1,3,5-9,11,13-15 NUMA node1 CPU(s): 2,4,10,12 When a driver requests to allocate 8 vectors, the following spread results: irq 31, cpu list 2,4 irq 32, cpu list 10,12 irq 33, cpu list 0-1 irq 34, cpu list 3,5 irq 35, cpu list 6-7 irq 36, cpu list 8-9 irq 37, cpu list 11,13 irq 38, cpu list 14-15 So Node 0 has now 6 and Node 1 has 2 vectors assigned. The original algorithm assigned 4 vectors on each node which was unfair versus Node 0. [ tglx: Massaged changelog ] Reported-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Link: https://lkml.kernel.org/r/20190816022849.14075-3-ming.lei@redhat.com
2019-08-16 10:28:49 +08:00
const struct cpumask *cpu_mask,
const nodemask_t nodemsk,
struct cpumask *nmsk,
struct node_vectors *node_vectors)
{
unsigned n, remaining_ncpus = 0;
for (n = 0; n < nr_node_ids; n++) {
node_vectors[n].id = n;
node_vectors[n].ncpus = UINT_MAX;
}
for_each_node_mask(n, nodemsk) {
unsigned ncpus;
cpumask_and(nmsk, cpu_mask, node_to_cpumask[n]);
ncpus = cpumask_weight(nmsk);
if (!ncpus)
continue;
remaining_ncpus += ncpus;
node_vectors[n].ncpus = ncpus;
}
numvecs = min_t(unsigned, remaining_ncpus, numvecs);
sort(node_vectors, nr_node_ids, sizeof(node_vectors[0]),
ncpus_cmp_func, NULL);
/*
* Allocate vectors for each node according to the ratio of this
* node's nr_cpus to remaining un-assigned ncpus. 'numvecs' is
* bigger than number of active numa nodes. Always start the
* allocation from the node with minimized nr_cpus.
*
* This way guarantees that each active node gets allocated at
* least one vector, and the theory is simple: over-allocation
* is only done when this node is assigned by one vector, so
* other nodes will be allocated >= 1 vector, since 'numvecs' is
* bigger than number of numa nodes.
*
* One perfect invariant is that number of allocated vectors for
* each node is <= CPU count of this node:
*
* 1) suppose there are two nodes: A and B
* ncpu(X) is CPU count of node X
* vecs(X) is the vector count allocated to node X via this
* algorithm
*
* ncpu(A) <= ncpu(B)
* ncpu(A) + ncpu(B) = N
* vecs(A) + vecs(B) = V
*
* vecs(A) = max(1, round_down(V * ncpu(A) / N))
* vecs(B) = V - vecs(A)
*
* both N and V are integer, and 2 <= V <= N, suppose
* V = N - delta, and 0 <= delta <= N - 2
*
* 2) obviously vecs(A) <= ncpu(A) because:
*
* if vecs(A) is 1, then vecs(A) <= ncpu(A) given
* ncpu(A) >= 1
*
* otherwise,
* vecs(A) <= V * ncpu(A) / N <= ncpu(A), given V <= N
*
* 3) prove how vecs(B) <= ncpu(B):
*
* if round_down(V * ncpu(A) / N) == 0, vecs(B) won't be
* over-allocated, so vecs(B) <= ncpu(B),
*
* otherwise:
*
* vecs(A) =
* round_down(V * ncpu(A) / N) =
* round_down((N - delta) * ncpu(A) / N) =
* round_down((N * ncpu(A) - delta * ncpu(A)) / N) >=
* round_down((N * ncpu(A) - delta * N) / N) =
* cpu(A) - delta
*
* then:
*
* vecs(A) - V >= ncpu(A) - delta - V
* =>
* V - vecs(A) <= V + delta - ncpu(A)
* =>
* vecs(B) <= N - ncpu(A)
* =>
* vecs(B) <= cpu(B)
*
* For nodes >= 3, it can be thought as one node and another big
* node given that is exactly what this algorithm is implemented,
* and we always re-calculate 'remaining_ncpus' & 'numvecs', and
* finally for each node X: vecs(X) <= ncpu(X).
*
*/
for (n = 0; n < nr_node_ids; n++) {
unsigned nvectors, ncpus;
if (node_vectors[n].ncpus == UINT_MAX)
continue;
WARN_ON_ONCE(numvecs == 0);
ncpus = node_vectors[n].ncpus;
nvectors = max_t(unsigned, 1,
numvecs * ncpus / remaining_ncpus);
WARN_ON_ONCE(nvectors > ncpus);
node_vectors[n].nvectors = nvectors;
remaining_ncpus -= ncpus;
numvecs -= nvectors;
}
}
static int __irq_build_affinity_masks(unsigned int startvec,
unsigned int numvecs,
unsigned int firstvec,
cpumask_var_t *node_to_cpumask,
const struct cpumask *cpu_mask,
struct cpumask *nmsk,
genirq/core: Introduce struct irq_affinity_desc The interrupt affinity management uses straight cpumask pointers to convey the automatically assigned affinity masks for managed interrupts. The core interrupt descriptor allocation also decides based on the pointer being non NULL whether an interrupt is managed or not. Devices which use managed interrupts usually have two classes of interrupts: - Interrupts for multiple device queues - Interrupts for general device management Currently both classes are treated the same way, i.e. as managed interrupts. The general interrupts get the default affinity mask assigned while the device queue interrupts are spread out over the possible CPUs. Treating the general interrupts as managed is both a limitation and under certain circumstances a bug. Assume the following situation: default_irq_affinity = 4..7 So if CPUs 4-7 are offlined, then the core code will shut down the device management interrupts because the last CPU in their affinity mask went offline. It's also a limitation because it's desired to allow manual placement of the general device interrupts for various reasons. If they are marked managed then the interrupt affinity setting from both user and kernel space is disabled. To remedy that situation it's required to convey more information than the cpumasks through various interfaces related to interrupt descriptor allocation. Instead of adding yet another argument, create a new data structure 'irq_affinity_desc' which for now just contains the cpumask. This struct can be expanded to convey auxilliary information in the next step. No functional change, just preparatory work. [ tglx: Simplified logic and clarified changelog ] Suggested-by: Thomas Gleixner <tglx@linutronix.de> Suggested-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Dou Liyang <douliyangs@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-pci@vger.kernel.org Cc: kashyap.desai@broadcom.com Cc: shivasharan.srikanteshwara@broadcom.com Cc: sumit.saxena@broadcom.com Cc: ming.lei@redhat.com Cc: hch@lst.de Cc: douliyang1@huawei.com Link: https://lkml.kernel.org/r/20181204155122.6327-2-douliyangs@gmail.com
2018-12-04 23:51:20 +08:00
struct irq_affinity_desc *masks)
{
genirq/affinity: Spread vectors on node according to nr_cpu ratio Now __irq_build_affinity_masks() spreads vectors evenly per node, but there is a case that not all vectors have been spread when each numa node has a different number of CPUs which triggers the warning in the spreading code. Improve the spreading algorithm by - assigning vectors according to the ratio of the number of CPUs on a node to the number of remaining CPUs. - running the assignment from smaller nodes to bigger nodes to guarantee that every active node gets allocated at least one vector. This ensures that all vectors are spread out. Asided of that the spread becomes more fair if the nodes have different number of CPUs. For example, on the following machine: CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 ... NUMA node0 CPU(s): 0,1,3,5-9,11,13-15 NUMA node1 CPU(s): 2,4,10,12 When a driver requests to allocate 8 vectors, the following spread results: irq 31, cpu list 2,4 irq 32, cpu list 10,12 irq 33, cpu list 0-1 irq 34, cpu list 3,5 irq 35, cpu list 6-7 irq 36, cpu list 8-9 irq 37, cpu list 11,13 irq 38, cpu list 14-15 So Node 0 has now 6 and Node 1 has 2 vectors assigned. The original algorithm assigned 4 vectors on each node which was unfair versus Node 0. [ tglx: Massaged changelog ] Reported-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Link: https://lkml.kernel.org/r/20190816022849.14075-3-ming.lei@redhat.com
2019-08-16 10:28:49 +08:00
unsigned int i, n, nodes, cpus_per_vec, extra_vecs, done = 0;
unsigned int last_affv = firstvec + numvecs;
unsigned int curvec = startvec;
nodemask_t nodemsk = NODE_MASK_NONE;
genirq/affinity: Spread vectors on node according to nr_cpu ratio Now __irq_build_affinity_masks() spreads vectors evenly per node, but there is a case that not all vectors have been spread when each numa node has a different number of CPUs which triggers the warning in the spreading code. Improve the spreading algorithm by - assigning vectors according to the ratio of the number of CPUs on a node to the number of remaining CPUs. - running the assignment from smaller nodes to bigger nodes to guarantee that every active node gets allocated at least one vector. This ensures that all vectors are spread out. Asided of that the spread becomes more fair if the nodes have different number of CPUs. For example, on the following machine: CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 ... NUMA node0 CPU(s): 0,1,3,5-9,11,13-15 NUMA node1 CPU(s): 2,4,10,12 When a driver requests to allocate 8 vectors, the following spread results: irq 31, cpu list 2,4 irq 32, cpu list 10,12 irq 33, cpu list 0-1 irq 34, cpu list 3,5 irq 35, cpu list 6-7 irq 36, cpu list 8-9 irq 37, cpu list 11,13 irq 38, cpu list 14-15 So Node 0 has now 6 and Node 1 has 2 vectors assigned. The original algorithm assigned 4 vectors on each node which was unfair versus Node 0. [ tglx: Massaged changelog ] Reported-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Link: https://lkml.kernel.org/r/20190816022849.14075-3-ming.lei@redhat.com
2019-08-16 10:28:49 +08:00
struct node_vectors *node_vectors;
if (cpumask_empty(cpu_mask))
genirq/affinity: Spread irq vectors among present CPUs as far as possible Commit 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs") tried to spread the interrupts accross all possible CPUs to make sure that in case of phsyical hotplug (e.g. virtualization) the CPUs which get plugged in after the device was initialized are targeted by a hardware queue and the corresponding interrupt. This has a downside in cases where the ACPI tables claim that there are more possible CPUs than present CPUs and the number of interrupts to spread out is smaller than the number of possible CPUs. These bogus ACPI tables are unfortunately not uncommon. In such a case the vector spreading algorithm assigns interrupts to CPUs which can never be utilized and as a consequence these interrupts are unused instead of being mapped to present CPUs. As a result the performance of the device is suboptimal. To fix this spread the interrupt vectors in two stages: 1) Spread as many interrupts as possible among the present CPUs 2) Spread the remaining vectors among non present CPUs On a 8 core system, where CPU 0-3 are present and CPU 4-7 are not present, for a device with 4 queues the resulting interrupt affinity is: 1) Before 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs") irq 39, cpu list 0 irq 40, cpu list 1 irq 41, cpu list 2 irq 42, cpu list 3 2) With 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs") irq 39, cpu list 0-2 irq 40, cpu list 3-4,6 irq 41, cpu list 5 irq 42, cpu list 7 3) With the refined vector spread applied: irq 39, cpu list 0,4 irq 40, cpu list 1,6 irq 41, cpu list 2,5 irq 42, cpu list 3,7 On a 8 core system, where all CPUs are present the resulting interrupt affinity for the 4 queues is: irq 39, cpu list 0,1 irq 40, cpu list 2,3 irq 41, cpu list 4,5 irq 42, cpu list 6,7 This is independent of the number of CPUs which are online at the point of initialization because in such a system the offline CPUs can be easily onlined afterwards, while in non-present CPUs need to be plugged physically or virtually which requires external interaction. The downside of this approach is that in case of physical hotplug the interrupt vector spreading might be suboptimal when CPUs 4-7 are physically plugged. Suboptimal from a NUMA point of view and due to the single target nature of interrupt affinities the later plugged CPUs might not be targeted by interrupts at all. Though, physical hotplug systems are not the common case while the broken ACPI table disease is wide spread. So it's preferred to have as many interrupts as possible utilized at the point where the device is initialized. Block multi-queue devices like NVME create a hardware queue per possible CPU, so the goal of commit 84676c1f21 to assign one interrupt vector per possible CPU is still achieved even with physical/virtual hotplug. [ tglx: Changed from online to present CPUs for the first spreading stage, renamed variables for readability sake, added comments and massaged changelog ] Reported-by: Laurence Oberman <loberman@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Link: https://lkml.kernel.org/r/20180308105358.1506-5-ming.lei@redhat.com
2018-03-08 18:53:58 +08:00
return 0;
nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, &nodemsk);
/*
genirq/affinity: Fix node generation from cpumask Commit 34c3d9819fda ("genirq/affinity: Provide smarter irq spreading infrastructure") introduced a better IRQ spreading mechanism, taking account of the available NUMA nodes in the machine. Problem is that the algorithm of retrieving the nodemask iterates "linearly" based on the number of online nodes - some architectures present non-linear node distribution among the nodemask, like PowerPC. If this is the case, the algorithm lead to a wrong node count number and therefore to a bad/incomplete IRQ affinity distribution. For example, this problem were found in a machine with 128 CPUs and two nodes, namely nodes 0 and 8 (instead of 0 and 1, if it was linearly distributed). This led to a wrong affinity distribution which then led to a bad mq allocation for nvme driver. Finally, we take the opportunity to fix a comment regarding the affinity distribution when we have _more_ nodes than vectors. Fixes: 34c3d9819fda ("genirq/affinity: Provide smarter irq spreading infrastructure") Reported-by: Gabriel Krisman Bertazi <gabriel@krisman.be> Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Gabriel Krisman Bertazi <gabriel@krisman.be> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Cc: linux-pci@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: hch@lst.de Link: http://lkml.kernel.org/r/1481738472-2671-1-git-send-email-gpiccoli@linux.vnet.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-15 02:01:12 +08:00
* If the number of nodes in the mask is greater than or equal the
* number of vectors we just spread the vectors across the nodes.
*/
if (numvecs <= nodes) {
for_each_node_mask(n, nodemsk) {
/* Ensure that only CPUs which are in both masks are set */
cpumask_and(nmsk, cpu_mask, node_to_cpumask[n]);
cpumask_or(&masks[curvec].mask, &masks[curvec].mask, nmsk);
if (++curvec == last_affv)
curvec = firstvec;
}
return numvecs;
}
genirq/affinity: Spread vectors on node according to nr_cpu ratio Now __irq_build_affinity_masks() spreads vectors evenly per node, but there is a case that not all vectors have been spread when each numa node has a different number of CPUs which triggers the warning in the spreading code. Improve the spreading algorithm by - assigning vectors according to the ratio of the number of CPUs on a node to the number of remaining CPUs. - running the assignment from smaller nodes to bigger nodes to guarantee that every active node gets allocated at least one vector. This ensures that all vectors are spread out. Asided of that the spread becomes more fair if the nodes have different number of CPUs. For example, on the following machine: CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 ... NUMA node0 CPU(s): 0,1,3,5-9,11,13-15 NUMA node1 CPU(s): 2,4,10,12 When a driver requests to allocate 8 vectors, the following spread results: irq 31, cpu list 2,4 irq 32, cpu list 10,12 irq 33, cpu list 0-1 irq 34, cpu list 3,5 irq 35, cpu list 6-7 irq 36, cpu list 8-9 irq 37, cpu list 11,13 irq 38, cpu list 14-15 So Node 0 has now 6 and Node 1 has 2 vectors assigned. The original algorithm assigned 4 vectors on each node which was unfair versus Node 0. [ tglx: Massaged changelog ] Reported-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Link: https://lkml.kernel.org/r/20190816022849.14075-3-ming.lei@redhat.com
2019-08-16 10:28:49 +08:00
node_vectors = kcalloc(nr_node_ids,
sizeof(struct node_vectors),
GFP_KERNEL);
if (!node_vectors)
return -ENOMEM;
/* allocate vector number for each node */
alloc_nodes_vectors(numvecs, node_to_cpumask, cpu_mask,
nodemsk, nmsk, node_vectors);
for (i = 0; i < nr_node_ids; i++) {
unsigned int ncpus, v;
struct node_vectors *nv = &node_vectors[i];
if (nv->nvectors == UINT_MAX)
continue;
/* Get the cpus on this node which are in the mask */
genirq/affinity: Spread vectors on node according to nr_cpu ratio Now __irq_build_affinity_masks() spreads vectors evenly per node, but there is a case that not all vectors have been spread when each numa node has a different number of CPUs which triggers the warning in the spreading code. Improve the spreading algorithm by - assigning vectors according to the ratio of the number of CPUs on a node to the number of remaining CPUs. - running the assignment from smaller nodes to bigger nodes to guarantee that every active node gets allocated at least one vector. This ensures that all vectors are spread out. Asided of that the spread becomes more fair if the nodes have different number of CPUs. For example, on the following machine: CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 ... NUMA node0 CPU(s): 0,1,3,5-9,11,13-15 NUMA node1 CPU(s): 2,4,10,12 When a driver requests to allocate 8 vectors, the following spread results: irq 31, cpu list 2,4 irq 32, cpu list 10,12 irq 33, cpu list 0-1 irq 34, cpu list 3,5 irq 35, cpu list 6-7 irq 36, cpu list 8-9 irq 37, cpu list 11,13 irq 38, cpu list 14-15 So Node 0 has now 6 and Node 1 has 2 vectors assigned. The original algorithm assigned 4 vectors on each node which was unfair versus Node 0. [ tglx: Massaged changelog ] Reported-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Link: https://lkml.kernel.org/r/20190816022849.14075-3-ming.lei@redhat.com
2019-08-16 10:28:49 +08:00
cpumask_and(nmsk, cpu_mask, node_to_cpumask[nv->id]);
ncpus = cpumask_weight(nmsk);
if (!ncpus)
continue;
genirq/affinity: Spread vectors on node according to nr_cpu ratio Now __irq_build_affinity_masks() spreads vectors evenly per node, but there is a case that not all vectors have been spread when each numa node has a different number of CPUs which triggers the warning in the spreading code. Improve the spreading algorithm by - assigning vectors according to the ratio of the number of CPUs on a node to the number of remaining CPUs. - running the assignment from smaller nodes to bigger nodes to guarantee that every active node gets allocated at least one vector. This ensures that all vectors are spread out. Asided of that the spread becomes more fair if the nodes have different number of CPUs. For example, on the following machine: CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 ... NUMA node0 CPU(s): 0,1,3,5-9,11,13-15 NUMA node1 CPU(s): 2,4,10,12 When a driver requests to allocate 8 vectors, the following spread results: irq 31, cpu list 2,4 irq 32, cpu list 10,12 irq 33, cpu list 0-1 irq 34, cpu list 3,5 irq 35, cpu list 6-7 irq 36, cpu list 8-9 irq 37, cpu list 11,13 irq 38, cpu list 14-15 So Node 0 has now 6 and Node 1 has 2 vectors assigned. The original algorithm assigned 4 vectors on each node which was unfair versus Node 0. [ tglx: Massaged changelog ] Reported-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Link: https://lkml.kernel.org/r/20190816022849.14075-3-ming.lei@redhat.com
2019-08-16 10:28:49 +08:00
WARN_ON_ONCE(nv->nvectors > ncpus);
/* Account for rounding errors */
genirq/affinity: Spread vectors on node according to nr_cpu ratio Now __irq_build_affinity_masks() spreads vectors evenly per node, but there is a case that not all vectors have been spread when each numa node has a different number of CPUs which triggers the warning in the spreading code. Improve the spreading algorithm by - assigning vectors according to the ratio of the number of CPUs on a node to the number of remaining CPUs. - running the assignment from smaller nodes to bigger nodes to guarantee that every active node gets allocated at least one vector. This ensures that all vectors are spread out. Asided of that the spread becomes more fair if the nodes have different number of CPUs. For example, on the following machine: CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 ... NUMA node0 CPU(s): 0,1,3,5-9,11,13-15 NUMA node1 CPU(s): 2,4,10,12 When a driver requests to allocate 8 vectors, the following spread results: irq 31, cpu list 2,4 irq 32, cpu list 10,12 irq 33, cpu list 0-1 irq 34, cpu list 3,5 irq 35, cpu list 6-7 irq 36, cpu list 8-9 irq 37, cpu list 11,13 irq 38, cpu list 14-15 So Node 0 has now 6 and Node 1 has 2 vectors assigned. The original algorithm assigned 4 vectors on each node which was unfair versus Node 0. [ tglx: Massaged changelog ] Reported-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Link: https://lkml.kernel.org/r/20190816022849.14075-3-ming.lei@redhat.com
2019-08-16 10:28:49 +08:00
extra_vecs = ncpus - nv->nvectors * (ncpus / nv->nvectors);
genirq/affinity: Spread vectors on node according to nr_cpu ratio Now __irq_build_affinity_masks() spreads vectors evenly per node, but there is a case that not all vectors have been spread when each numa node has a different number of CPUs which triggers the warning in the spreading code. Improve the spreading algorithm by - assigning vectors according to the ratio of the number of CPUs on a node to the number of remaining CPUs. - running the assignment from smaller nodes to bigger nodes to guarantee that every active node gets allocated at least one vector. This ensures that all vectors are spread out. Asided of that the spread becomes more fair if the nodes have different number of CPUs. For example, on the following machine: CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 ... NUMA node0 CPU(s): 0,1,3,5-9,11,13-15 NUMA node1 CPU(s): 2,4,10,12 When a driver requests to allocate 8 vectors, the following spread results: irq 31, cpu list 2,4 irq 32, cpu list 10,12 irq 33, cpu list 0-1 irq 34, cpu list 3,5 irq 35, cpu list 6-7 irq 36, cpu list 8-9 irq 37, cpu list 11,13 irq 38, cpu list 14-15 So Node 0 has now 6 and Node 1 has 2 vectors assigned. The original algorithm assigned 4 vectors on each node which was unfair versus Node 0. [ tglx: Massaged changelog ] Reported-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Link: https://lkml.kernel.org/r/20190816022849.14075-3-ming.lei@redhat.com
2019-08-16 10:28:49 +08:00
/* Spread allocated vectors on CPUs of the current node */
for (v = 0; v < nv->nvectors; v++, curvec++) {
cpus_per_vec = ncpus / nv->nvectors;
/* Account for extra vectors to compensate rounding errors */
if (extra_vecs) {
cpus_per_vec++;
--extra_vecs;
}
genirq/affinity: Spread vectors on node according to nr_cpu ratio Now __irq_build_affinity_masks() spreads vectors evenly per node, but there is a case that not all vectors have been spread when each numa node has a different number of CPUs which triggers the warning in the spreading code. Improve the spreading algorithm by - assigning vectors according to the ratio of the number of CPUs on a node to the number of remaining CPUs. - running the assignment from smaller nodes to bigger nodes to guarantee that every active node gets allocated at least one vector. This ensures that all vectors are spread out. Asided of that the spread becomes more fair if the nodes have different number of CPUs. For example, on the following machine: CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 ... NUMA node0 CPU(s): 0,1,3,5-9,11,13-15 NUMA node1 CPU(s): 2,4,10,12 When a driver requests to allocate 8 vectors, the following spread results: irq 31, cpu list 2,4 irq 32, cpu list 10,12 irq 33, cpu list 0-1 irq 34, cpu list 3,5 irq 35, cpu list 6-7 irq 36, cpu list 8-9 irq 37, cpu list 11,13 irq 38, cpu list 14-15 So Node 0 has now 6 and Node 1 has 2 vectors assigned. The original algorithm assigned 4 vectors on each node which was unfair versus Node 0. [ tglx: Massaged changelog ] Reported-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Link: https://lkml.kernel.org/r/20190816022849.14075-3-ming.lei@redhat.com
2019-08-16 10:28:49 +08:00
/*
* wrapping has to be considered given 'startvec'
* may start anywhere
*/
if (curvec >= last_affv)
curvec = firstvec;
genirq/core: Introduce struct irq_affinity_desc The interrupt affinity management uses straight cpumask pointers to convey the automatically assigned affinity masks for managed interrupts. The core interrupt descriptor allocation also decides based on the pointer being non NULL whether an interrupt is managed or not. Devices which use managed interrupts usually have two classes of interrupts: - Interrupts for multiple device queues - Interrupts for general device management Currently both classes are treated the same way, i.e. as managed interrupts. The general interrupts get the default affinity mask assigned while the device queue interrupts are spread out over the possible CPUs. Treating the general interrupts as managed is both a limitation and under certain circumstances a bug. Assume the following situation: default_irq_affinity = 4..7 So if CPUs 4-7 are offlined, then the core code will shut down the device management interrupts because the last CPU in their affinity mask went offline. It's also a limitation because it's desired to allow manual placement of the general device interrupts for various reasons. If they are marked managed then the interrupt affinity setting from both user and kernel space is disabled. To remedy that situation it's required to convey more information than the cpumasks through various interfaces related to interrupt descriptor allocation. Instead of adding yet another argument, create a new data structure 'irq_affinity_desc' which for now just contains the cpumask. This struct can be expanded to convey auxilliary information in the next step. No functional change, just preparatory work. [ tglx: Simplified logic and clarified changelog ] Suggested-by: Thomas Gleixner <tglx@linutronix.de> Suggested-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Dou Liyang <douliyangs@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-pci@vger.kernel.org Cc: kashyap.desai@broadcom.com Cc: shivasharan.srikanteshwara@broadcom.com Cc: sumit.saxena@broadcom.com Cc: ming.lei@redhat.com Cc: hch@lst.de Cc: douliyang1@huawei.com Link: https://lkml.kernel.org/r/20181204155122.6327-2-douliyangs@gmail.com
2018-12-04 23:51:20 +08:00
irq_spread_init_one(&masks[curvec].mask, nmsk,
cpus_per_vec);
}
genirq/affinity: Spread vectors on node according to nr_cpu ratio Now __irq_build_affinity_masks() spreads vectors evenly per node, but there is a case that not all vectors have been spread when each numa node has a different number of CPUs which triggers the warning in the spreading code. Improve the spreading algorithm by - assigning vectors according to the ratio of the number of CPUs on a node to the number of remaining CPUs. - running the assignment from smaller nodes to bigger nodes to guarantee that every active node gets allocated at least one vector. This ensures that all vectors are spread out. Asided of that the spread becomes more fair if the nodes have different number of CPUs. For example, on the following machine: CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 ... NUMA node0 CPU(s): 0,1,3,5-9,11,13-15 NUMA node1 CPU(s): 2,4,10,12 When a driver requests to allocate 8 vectors, the following spread results: irq 31, cpu list 2,4 irq 32, cpu list 10,12 irq 33, cpu list 0-1 irq 34, cpu list 3,5 irq 35, cpu list 6-7 irq 36, cpu list 8-9 irq 37, cpu list 11,13 irq 38, cpu list 14-15 So Node 0 has now 6 and Node 1 has 2 vectors assigned. The original algorithm assigned 4 vectors on each node which was unfair versus Node 0. [ tglx: Massaged changelog ] Reported-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Link: https://lkml.kernel.org/r/20190816022849.14075-3-ming.lei@redhat.com
2019-08-16 10:28:49 +08:00
done += nv->nvectors;
}
genirq/affinity: Spread vectors on node according to nr_cpu ratio Now __irq_build_affinity_masks() spreads vectors evenly per node, but there is a case that not all vectors have been spread when each numa node has a different number of CPUs which triggers the warning in the spreading code. Improve the spreading algorithm by - assigning vectors according to the ratio of the number of CPUs on a node to the number of remaining CPUs. - running the assignment from smaller nodes to bigger nodes to guarantee that every active node gets allocated at least one vector. This ensures that all vectors are spread out. Asided of that the spread becomes more fair if the nodes have different number of CPUs. For example, on the following machine: CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 ... NUMA node0 CPU(s): 0,1,3,5-9,11,13-15 NUMA node1 CPU(s): 2,4,10,12 When a driver requests to allocate 8 vectors, the following spread results: irq 31, cpu list 2,4 irq 32, cpu list 10,12 irq 33, cpu list 0-1 irq 34, cpu list 3,5 irq 35, cpu list 6-7 irq 36, cpu list 8-9 irq 37, cpu list 11,13 irq 38, cpu list 14-15 So Node 0 has now 6 and Node 1 has 2 vectors assigned. The original algorithm assigned 4 vectors on each node which was unfair versus Node 0. [ tglx: Massaged changelog ] Reported-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Link: https://lkml.kernel.org/r/20190816022849.14075-3-ming.lei@redhat.com
2019-08-16 10:28:49 +08:00
kfree(node_vectors);
return done;
}
/*
* build affinity in two stages:
* 1) spread present CPU on these vectors
* 2) spread other possible CPUs on these vectors
*/
static int irq_build_affinity_masks(unsigned int startvec, unsigned int numvecs,
unsigned int firstvec,
genirq/core: Introduce struct irq_affinity_desc The interrupt affinity management uses straight cpumask pointers to convey the automatically assigned affinity masks for managed interrupts. The core interrupt descriptor allocation also decides based on the pointer being non NULL whether an interrupt is managed or not. Devices which use managed interrupts usually have two classes of interrupts: - Interrupts for multiple device queues - Interrupts for general device management Currently both classes are treated the same way, i.e. as managed interrupts. The general interrupts get the default affinity mask assigned while the device queue interrupts are spread out over the possible CPUs. Treating the general interrupts as managed is both a limitation and under certain circumstances a bug. Assume the following situation: default_irq_affinity = 4..7 So if CPUs 4-7 are offlined, then the core code will shut down the device management interrupts because the last CPU in their affinity mask went offline. It's also a limitation because it's desired to allow manual placement of the general device interrupts for various reasons. If they are marked managed then the interrupt affinity setting from both user and kernel space is disabled. To remedy that situation it's required to convey more information than the cpumasks through various interfaces related to interrupt descriptor allocation. Instead of adding yet another argument, create a new data structure 'irq_affinity_desc' which for now just contains the cpumask. This struct can be expanded to convey auxilliary information in the next step. No functional change, just preparatory work. [ tglx: Simplified logic and clarified changelog ] Suggested-by: Thomas Gleixner <tglx@linutronix.de> Suggested-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Dou Liyang <douliyangs@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-pci@vger.kernel.org Cc: kashyap.desai@broadcom.com Cc: shivasharan.srikanteshwara@broadcom.com Cc: sumit.saxena@broadcom.com Cc: ming.lei@redhat.com Cc: hch@lst.de Cc: douliyang1@huawei.com Link: https://lkml.kernel.org/r/20181204155122.6327-2-douliyangs@gmail.com
2018-12-04 23:51:20 +08:00
struct irq_affinity_desc *masks)
{
genirq/affinity: Spread vectors on node according to nr_cpu ratio Now __irq_build_affinity_masks() spreads vectors evenly per node, but there is a case that not all vectors have been spread when each numa node has a different number of CPUs which triggers the warning in the spreading code. Improve the spreading algorithm by - assigning vectors according to the ratio of the number of CPUs on a node to the number of remaining CPUs. - running the assignment from smaller nodes to bigger nodes to guarantee that every active node gets allocated at least one vector. This ensures that all vectors are spread out. Asided of that the spread becomes more fair if the nodes have different number of CPUs. For example, on the following machine: CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 ... NUMA node0 CPU(s): 0,1,3,5-9,11,13-15 NUMA node1 CPU(s): 2,4,10,12 When a driver requests to allocate 8 vectors, the following spread results: irq 31, cpu list 2,4 irq 32, cpu list 10,12 irq 33, cpu list 0-1 irq 34, cpu list 3,5 irq 35, cpu list 6-7 irq 36, cpu list 8-9 irq 37, cpu list 11,13 irq 38, cpu list 14-15 So Node 0 has now 6 and Node 1 has 2 vectors assigned. The original algorithm assigned 4 vectors on each node which was unfair versus Node 0. [ tglx: Massaged changelog ] Reported-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Link: https://lkml.kernel.org/r/20190816022849.14075-3-ming.lei@redhat.com
2019-08-16 10:28:49 +08:00
unsigned int curvec = startvec, nr_present = 0, nr_others = 0;
cpumask_var_t *node_to_cpumask;
cpumask_var_t nmsk, npresmsk;
int ret = -ENOMEM;
if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
return ret;
if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL))
goto fail_nmsk;
node_to_cpumask = alloc_node_to_cpumask();
if (!node_to_cpumask)
goto fail_npresmsk;
/* Stabilize the cpumasks */
cpus_read_lock();
build_node_to_cpumask(node_to_cpumask);
/* Spread on present CPUs starting from affd->pre_vectors */
genirq/affinity: Spread vectors on node according to nr_cpu ratio Now __irq_build_affinity_masks() spreads vectors evenly per node, but there is a case that not all vectors have been spread when each numa node has a different number of CPUs which triggers the warning in the spreading code. Improve the spreading algorithm by - assigning vectors according to the ratio of the number of CPUs on a node to the number of remaining CPUs. - running the assignment from smaller nodes to bigger nodes to guarantee that every active node gets allocated at least one vector. This ensures that all vectors are spread out. Asided of that the spread becomes more fair if the nodes have different number of CPUs. For example, on the following machine: CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 ... NUMA node0 CPU(s): 0,1,3,5-9,11,13-15 NUMA node1 CPU(s): 2,4,10,12 When a driver requests to allocate 8 vectors, the following spread results: irq 31, cpu list 2,4 irq 32, cpu list 10,12 irq 33, cpu list 0-1 irq 34, cpu list 3,5 irq 35, cpu list 6-7 irq 36, cpu list 8-9 irq 37, cpu list 11,13 irq 38, cpu list 14-15 So Node 0 has now 6 and Node 1 has 2 vectors assigned. The original algorithm assigned 4 vectors on each node which was unfair versus Node 0. [ tglx: Massaged changelog ] Reported-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Link: https://lkml.kernel.org/r/20190816022849.14075-3-ming.lei@redhat.com
2019-08-16 10:28:49 +08:00
ret = __irq_build_affinity_masks(curvec, numvecs, firstvec,
node_to_cpumask, cpu_present_mask,
nmsk, masks);
if (ret < 0)
goto fail_build_affinity;
nr_present = ret;
/*
* Spread on non present CPUs starting from the next vector to be
* handled. If the spreading of present CPUs already exhausted the
* vector space, assign the non present CPUs to the already spread
* out vectors.
*/
if (nr_present >= numvecs)
curvec = firstvec;
else
curvec = firstvec + nr_present;
cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask);
genirq/affinity: Spread vectors on node according to nr_cpu ratio Now __irq_build_affinity_masks() spreads vectors evenly per node, but there is a case that not all vectors have been spread when each numa node has a different number of CPUs which triggers the warning in the spreading code. Improve the spreading algorithm by - assigning vectors according to the ratio of the number of CPUs on a node to the number of remaining CPUs. - running the assignment from smaller nodes to bigger nodes to guarantee that every active node gets allocated at least one vector. This ensures that all vectors are spread out. Asided of that the spread becomes more fair if the nodes have different number of CPUs. For example, on the following machine: CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 ... NUMA node0 CPU(s): 0,1,3,5-9,11,13-15 NUMA node1 CPU(s): 2,4,10,12 When a driver requests to allocate 8 vectors, the following spread results: irq 31, cpu list 2,4 irq 32, cpu list 10,12 irq 33, cpu list 0-1 irq 34, cpu list 3,5 irq 35, cpu list 6-7 irq 36, cpu list 8-9 irq 37, cpu list 11,13 irq 38, cpu list 14-15 So Node 0 has now 6 and Node 1 has 2 vectors assigned. The original algorithm assigned 4 vectors on each node which was unfair versus Node 0. [ tglx: Massaged changelog ] Reported-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Link: https://lkml.kernel.org/r/20190816022849.14075-3-ming.lei@redhat.com
2019-08-16 10:28:49 +08:00
ret = __irq_build_affinity_masks(curvec, numvecs, firstvec,
node_to_cpumask, npresmsk, nmsk,
masks);
if (ret >= 0)
nr_others = ret;
fail_build_affinity:
cpus_read_unlock();
genirq/affinity: Spread vectors on node according to nr_cpu ratio Now __irq_build_affinity_masks() spreads vectors evenly per node, but there is a case that not all vectors have been spread when each numa node has a different number of CPUs which triggers the warning in the spreading code. Improve the spreading algorithm by - assigning vectors according to the ratio of the number of CPUs on a node to the number of remaining CPUs. - running the assignment from smaller nodes to bigger nodes to guarantee that every active node gets allocated at least one vector. This ensures that all vectors are spread out. Asided of that the spread becomes more fair if the nodes have different number of CPUs. For example, on the following machine: CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 ... NUMA node0 CPU(s): 0,1,3,5-9,11,13-15 NUMA node1 CPU(s): 2,4,10,12 When a driver requests to allocate 8 vectors, the following spread results: irq 31, cpu list 2,4 irq 32, cpu list 10,12 irq 33, cpu list 0-1 irq 34, cpu list 3,5 irq 35, cpu list 6-7 irq 36, cpu list 8-9 irq 37, cpu list 11,13 irq 38, cpu list 14-15 So Node 0 has now 6 and Node 1 has 2 vectors assigned. The original algorithm assigned 4 vectors on each node which was unfair versus Node 0. [ tglx: Massaged changelog ] Reported-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Link: https://lkml.kernel.org/r/20190816022849.14075-3-ming.lei@redhat.com
2019-08-16 10:28:49 +08:00
if (ret >= 0)
WARN_ON(nr_present + nr_others < numvecs);
free_node_to_cpumask(node_to_cpumask);
fail_npresmsk:
free_cpumask_var(npresmsk);
fail_nmsk:
free_cpumask_var(nmsk);
genirq/affinity: Spread vectors on node according to nr_cpu ratio Now __irq_build_affinity_masks() spreads vectors evenly per node, but there is a case that not all vectors have been spread when each numa node has a different number of CPUs which triggers the warning in the spreading code. Improve the spreading algorithm by - assigning vectors according to the ratio of the number of CPUs on a node to the number of remaining CPUs. - running the assignment from smaller nodes to bigger nodes to guarantee that every active node gets allocated at least one vector. This ensures that all vectors are spread out. Asided of that the spread becomes more fair if the nodes have different number of CPUs. For example, on the following machine: CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 ... NUMA node0 CPU(s): 0,1,3,5-9,11,13-15 NUMA node1 CPU(s): 2,4,10,12 When a driver requests to allocate 8 vectors, the following spread results: irq 31, cpu list 2,4 irq 32, cpu list 10,12 irq 33, cpu list 0-1 irq 34, cpu list 3,5 irq 35, cpu list 6-7 irq 36, cpu list 8-9 irq 37, cpu list 11,13 irq 38, cpu list 14-15 So Node 0 has now 6 and Node 1 has 2 vectors assigned. The original algorithm assigned 4 vectors on each node which was unfair versus Node 0. [ tglx: Massaged changelog ] Reported-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Link: https://lkml.kernel.org/r/20190816022849.14075-3-ming.lei@redhat.com
2019-08-16 10:28:49 +08:00
return ret < 0 ? ret : 0;
}
genirq/affinity: Add new callback for (re)calculating interrupt sets The interrupt affinity spreading mechanism supports to spread out affinities for one or more interrupt sets. A interrupt set contains one or more interrupts. Each set is mapped to a specific functionality of a device, e.g. general I/O queues and read I/O queus of multiqueue block devices. The number of interrupts per set is defined by the driver. It depends on the total number of available interrupts for the device, which is determined by the PCI capabilites and the availability of underlying CPU resources, and the number of queues which the device provides and the driver wants to instantiate. The driver passes initial configuration for the interrupt allocation via a pointer to struct irq_affinity. Right now the allocation mechanism is complex as it requires to have a loop in the driver to determine the maximum number of interrupts which are provided by the PCI capabilities and the underlying CPU resources. This loop would have to be replicated in every driver which wants to utilize this mechanism. That's unwanted code duplication and error prone. In order to move this into generic facilities it is required to have a mechanism, which allows the recalculation of the interrupt sets and their size, in the core code. As the core code does not have any knowledge about the underlying device, a driver specific callback is required in struct irq_affinity, which can be invoked by the core code. The callback gets the number of available interupts as an argument, so the driver can calculate the corresponding number and size of interrupt sets. At the moment the struct irq_affinity pointer which is handed in from the driver and passed through to several core functions is marked 'const', but for the callback to be able to modify the data in the struct it's required to remove the 'const' qualifier. Add the optional callback to struct irq_affinity, which allows drivers to recalculate the number and size of interrupt sets and remove the 'const' qualifier. For simple invocations, which do not supply a callback, a default callback is installed, which just sets nr_sets to 1 and transfers the number of spreadable vectors to the set_size array at index 0. This is for now guarded by a check for nr_sets != 0 to keep the NVME driver working until it is converted to the callback mechanism. To make sure that the driver configuration is correct under all circumstances the callback is invoked even when there are no interrupts for queues left, i.e. the pre/post requirements already exhaust the numner of available interrupts. At the PCI layer irq_create_affinity_masks() has to be invoked even for the case where the legacy interrupt is used. That ensures that the callback is invoked and the device driver can adjust to that situation. [ tglx: Fixed the simple case (no sets required). Moved the sanity check for nr_sets after the invocation of the callback so it catches broken drivers. Fixed the kernel doc comments for struct irq_affinity and de-'This patch'-ed the changelog ] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bjorn Helgaas <helgaas@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Sagi Grimberg <sagi@grimberg.me> Cc: linux-nvme@lists.infradead.org Cc: linux-pci@vger.kernel.org Cc: Keith Busch <keith.busch@intel.com> Cc: Sumit Saxena <sumit.saxena@broadcom.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com> Link: https://lkml.kernel.org/r/20190216172228.512444498@linutronix.de
2019-02-17 01:13:09 +08:00
static void default_calc_sets(struct irq_affinity *affd, unsigned int affvecs)
{
affd->nr_sets = 1;
affd->set_size[0] = affvecs;
}
/**
* irq_create_affinity_masks - Create affinity masks for multiqueue spreading
* @nvecs: The total number of vectors
* @affd: Description of the affinity requirements
*
genirq/core: Introduce struct irq_affinity_desc The interrupt affinity management uses straight cpumask pointers to convey the automatically assigned affinity masks for managed interrupts. The core interrupt descriptor allocation also decides based on the pointer being non NULL whether an interrupt is managed or not. Devices which use managed interrupts usually have two classes of interrupts: - Interrupts for multiple device queues - Interrupts for general device management Currently both classes are treated the same way, i.e. as managed interrupts. The general interrupts get the default affinity mask assigned while the device queue interrupts are spread out over the possible CPUs. Treating the general interrupts as managed is both a limitation and under certain circumstances a bug. Assume the following situation: default_irq_affinity = 4..7 So if CPUs 4-7 are offlined, then the core code will shut down the device management interrupts because the last CPU in their affinity mask went offline. It's also a limitation because it's desired to allow manual placement of the general device interrupts for various reasons. If they are marked managed then the interrupt affinity setting from both user and kernel space is disabled. To remedy that situation it's required to convey more information than the cpumasks through various interfaces related to interrupt descriptor allocation. Instead of adding yet another argument, create a new data structure 'irq_affinity_desc' which for now just contains the cpumask. This struct can be expanded to convey auxilliary information in the next step. No functional change, just preparatory work. [ tglx: Simplified logic and clarified changelog ] Suggested-by: Thomas Gleixner <tglx@linutronix.de> Suggested-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Dou Liyang <douliyangs@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-pci@vger.kernel.org Cc: kashyap.desai@broadcom.com Cc: shivasharan.srikanteshwara@broadcom.com Cc: sumit.saxena@broadcom.com Cc: ming.lei@redhat.com Cc: hch@lst.de Cc: douliyang1@huawei.com Link: https://lkml.kernel.org/r/20181204155122.6327-2-douliyangs@gmail.com
2018-12-04 23:51:20 +08:00
* Returns the irq_affinity_desc pointer or NULL if allocation failed.
*/
genirq/core: Introduce struct irq_affinity_desc The interrupt affinity management uses straight cpumask pointers to convey the automatically assigned affinity masks for managed interrupts. The core interrupt descriptor allocation also decides based on the pointer being non NULL whether an interrupt is managed or not. Devices which use managed interrupts usually have two classes of interrupts: - Interrupts for multiple device queues - Interrupts for general device management Currently both classes are treated the same way, i.e. as managed interrupts. The general interrupts get the default affinity mask assigned while the device queue interrupts are spread out over the possible CPUs. Treating the general interrupts as managed is both a limitation and under certain circumstances a bug. Assume the following situation: default_irq_affinity = 4..7 So if CPUs 4-7 are offlined, then the core code will shut down the device management interrupts because the last CPU in their affinity mask went offline. It's also a limitation because it's desired to allow manual placement of the general device interrupts for various reasons. If they are marked managed then the interrupt affinity setting from both user and kernel space is disabled. To remedy that situation it's required to convey more information than the cpumasks through various interfaces related to interrupt descriptor allocation. Instead of adding yet another argument, create a new data structure 'irq_affinity_desc' which for now just contains the cpumask. This struct can be expanded to convey auxilliary information in the next step. No functional change, just preparatory work. [ tglx: Simplified logic and clarified changelog ] Suggested-by: Thomas Gleixner <tglx@linutronix.de> Suggested-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Dou Liyang <douliyangs@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-pci@vger.kernel.org Cc: kashyap.desai@broadcom.com Cc: shivasharan.srikanteshwara@broadcom.com Cc: sumit.saxena@broadcom.com Cc: ming.lei@redhat.com Cc: hch@lst.de Cc: douliyang1@huawei.com Link: https://lkml.kernel.org/r/20181204155122.6327-2-douliyangs@gmail.com
2018-12-04 23:51:20 +08:00
struct irq_affinity_desc *
genirq/affinity: Store interrupt sets size in struct irq_affinity The interrupt affinity spreading mechanism supports to spread out affinities for one or more interrupt sets. A interrupt set contains one or more interrupts. Each set is mapped to a specific functionality of a device, e.g. general I/O queues and read I/O queus of multiqueue block devices. The number of interrupts per set is defined by the driver. It depends on the total number of available interrupts for the device, which is determined by the PCI capabilites and the availability of underlying CPU resources, and the number of queues which the device provides and the driver wants to instantiate. The driver passes initial configuration for the interrupt allocation via a pointer to struct irq_affinity. Right now the allocation mechanism is complex as it requires to have a loop in the driver to determine the maximum number of interrupts which are provided by the PCI capabilities and the underlying CPU resources. This loop would have to be replicated in every driver which wants to utilize this mechanism. That's unwanted code duplication and error prone. In order to move this into generic facilities it is required to have a mechanism, which allows the recalculation of the interrupt sets and their size, in the core code. As the core code does not have any knowledge about the underlying device, a driver specific callback will be added to struct affinity_desc, which will be invoked by the core code. The callback will get the number of available interupts as an argument, so the driver can calculate the corresponding number and size of interrupt sets. To support this, two modifications for the handling of struct irq_affinity are required: 1) The (optional) interrupt sets size information is contained in a separate array of integers and struct irq_affinity contains a pointer to it. This is cumbersome and as the maximum number of interrupt sets is small, there is no reason to have separate storage. Moving the size array into struct affinity_desc avoids indirections and makes the code simpler. 2) At the moment the struct irq_affinity pointer which is handed in from the driver and passed through to several core functions is marked 'const'. With the upcoming callback to recalculate the number and size of interrupt sets, it's necessary to remove the 'const' qualifier. Otherwise the callback would not be able to update the data. Implement #1 and store the interrupt sets size in 'struct irq_affinity'. No functional change. [ tglx: Fixed the memcpy() size so it won't copy beyond the size of the source. Fixed the kernel doc comments for struct irq_affinity and de-'This patch'-ed the changelog ] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bjorn Helgaas <helgaas@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Sagi Grimberg <sagi@grimberg.me> Cc: linux-nvme@lists.infradead.org Cc: linux-pci@vger.kernel.org Cc: Keith Busch <keith.busch@intel.com> Cc: Sumit Saxena <sumit.saxena@broadcom.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com> Link: https://lkml.kernel.org/r/20190216172228.423723127@linutronix.de
2019-02-17 01:13:08 +08:00
irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
{
genirq/affinity: Add new callback for (re)calculating interrupt sets The interrupt affinity spreading mechanism supports to spread out affinities for one or more interrupt sets. A interrupt set contains one or more interrupts. Each set is mapped to a specific functionality of a device, e.g. general I/O queues and read I/O queus of multiqueue block devices. The number of interrupts per set is defined by the driver. It depends on the total number of available interrupts for the device, which is determined by the PCI capabilites and the availability of underlying CPU resources, and the number of queues which the device provides and the driver wants to instantiate. The driver passes initial configuration for the interrupt allocation via a pointer to struct irq_affinity. Right now the allocation mechanism is complex as it requires to have a loop in the driver to determine the maximum number of interrupts which are provided by the PCI capabilities and the underlying CPU resources. This loop would have to be replicated in every driver which wants to utilize this mechanism. That's unwanted code duplication and error prone. In order to move this into generic facilities it is required to have a mechanism, which allows the recalculation of the interrupt sets and their size, in the core code. As the core code does not have any knowledge about the underlying device, a driver specific callback is required in struct irq_affinity, which can be invoked by the core code. The callback gets the number of available interupts as an argument, so the driver can calculate the corresponding number and size of interrupt sets. At the moment the struct irq_affinity pointer which is handed in from the driver and passed through to several core functions is marked 'const', but for the callback to be able to modify the data in the struct it's required to remove the 'const' qualifier. Add the optional callback to struct irq_affinity, which allows drivers to recalculate the number and size of interrupt sets and remove the 'const' qualifier. For simple invocations, which do not supply a callback, a default callback is installed, which just sets nr_sets to 1 and transfers the number of spreadable vectors to the set_size array at index 0. This is for now guarded by a check for nr_sets != 0 to keep the NVME driver working until it is converted to the callback mechanism. To make sure that the driver configuration is correct under all circumstances the callback is invoked even when there are no interrupts for queues left, i.e. the pre/post requirements already exhaust the numner of available interrupts. At the PCI layer irq_create_affinity_masks() has to be invoked even for the case where the legacy interrupt is used. That ensures that the callback is invoked and the device driver can adjust to that situation. [ tglx: Fixed the simple case (no sets required). Moved the sanity check for nr_sets after the invocation of the callback so it catches broken drivers. Fixed the kernel doc comments for struct irq_affinity and de-'This patch'-ed the changelog ] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bjorn Helgaas <helgaas@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Sagi Grimberg <sagi@grimberg.me> Cc: linux-nvme@lists.infradead.org Cc: linux-pci@vger.kernel.org Cc: Keith Busch <keith.busch@intel.com> Cc: Sumit Saxena <sumit.saxena@broadcom.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com> Link: https://lkml.kernel.org/r/20190216172228.512444498@linutronix.de
2019-02-17 01:13:09 +08:00
unsigned int affvecs, curvec, usedvecs, i;
genirq/core: Introduce struct irq_affinity_desc The interrupt affinity management uses straight cpumask pointers to convey the automatically assigned affinity masks for managed interrupts. The core interrupt descriptor allocation also decides based on the pointer being non NULL whether an interrupt is managed or not. Devices which use managed interrupts usually have two classes of interrupts: - Interrupts for multiple device queues - Interrupts for general device management Currently both classes are treated the same way, i.e. as managed interrupts. The general interrupts get the default affinity mask assigned while the device queue interrupts are spread out over the possible CPUs. Treating the general interrupts as managed is both a limitation and under certain circumstances a bug. Assume the following situation: default_irq_affinity = 4..7 So if CPUs 4-7 are offlined, then the core code will shut down the device management interrupts because the last CPU in their affinity mask went offline. It's also a limitation because it's desired to allow manual placement of the general device interrupts for various reasons. If they are marked managed then the interrupt affinity setting from both user and kernel space is disabled. To remedy that situation it's required to convey more information than the cpumasks through various interfaces related to interrupt descriptor allocation. Instead of adding yet another argument, create a new data structure 'irq_affinity_desc' which for now just contains the cpumask. This struct can be expanded to convey auxilliary information in the next step. No functional change, just preparatory work. [ tglx: Simplified logic and clarified changelog ] Suggested-by: Thomas Gleixner <tglx@linutronix.de> Suggested-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Dou Liyang <douliyangs@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-pci@vger.kernel.org Cc: kashyap.desai@broadcom.com Cc: shivasharan.srikanteshwara@broadcom.com Cc: sumit.saxena@broadcom.com Cc: ming.lei@redhat.com Cc: hch@lst.de Cc: douliyang1@huawei.com Link: https://lkml.kernel.org/r/20181204155122.6327-2-douliyangs@gmail.com
2018-12-04 23:51:20 +08:00
struct irq_affinity_desc *masks = NULL;
/*
genirq/affinity: Add new callback for (re)calculating interrupt sets The interrupt affinity spreading mechanism supports to spread out affinities for one or more interrupt sets. A interrupt set contains one or more interrupts. Each set is mapped to a specific functionality of a device, e.g. general I/O queues and read I/O queus of multiqueue block devices. The number of interrupts per set is defined by the driver. It depends on the total number of available interrupts for the device, which is determined by the PCI capabilites and the availability of underlying CPU resources, and the number of queues which the device provides and the driver wants to instantiate. The driver passes initial configuration for the interrupt allocation via a pointer to struct irq_affinity. Right now the allocation mechanism is complex as it requires to have a loop in the driver to determine the maximum number of interrupts which are provided by the PCI capabilities and the underlying CPU resources. This loop would have to be replicated in every driver which wants to utilize this mechanism. That's unwanted code duplication and error prone. In order to move this into generic facilities it is required to have a mechanism, which allows the recalculation of the interrupt sets and their size, in the core code. As the core code does not have any knowledge about the underlying device, a driver specific callback is required in struct irq_affinity, which can be invoked by the core code. The callback gets the number of available interupts as an argument, so the driver can calculate the corresponding number and size of interrupt sets. At the moment the struct irq_affinity pointer which is handed in from the driver and passed through to several core functions is marked 'const', but for the callback to be able to modify the data in the struct it's required to remove the 'const' qualifier. Add the optional callback to struct irq_affinity, which allows drivers to recalculate the number and size of interrupt sets and remove the 'const' qualifier. For simple invocations, which do not supply a callback, a default callback is installed, which just sets nr_sets to 1 and transfers the number of spreadable vectors to the set_size array at index 0. This is for now guarded by a check for nr_sets != 0 to keep the NVME driver working until it is converted to the callback mechanism. To make sure that the driver configuration is correct under all circumstances the callback is invoked even when there are no interrupts for queues left, i.e. the pre/post requirements already exhaust the numner of available interrupts. At the PCI layer irq_create_affinity_masks() has to be invoked even for the case where the legacy interrupt is used. That ensures that the callback is invoked and the device driver can adjust to that situation. [ tglx: Fixed the simple case (no sets required). Moved the sanity check for nr_sets after the invocation of the callback so it catches broken drivers. Fixed the kernel doc comments for struct irq_affinity and de-'This patch'-ed the changelog ] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bjorn Helgaas <helgaas@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Sagi Grimberg <sagi@grimberg.me> Cc: linux-nvme@lists.infradead.org Cc: linux-pci@vger.kernel.org Cc: Keith Busch <keith.busch@intel.com> Cc: Sumit Saxena <sumit.saxena@broadcom.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com> Link: https://lkml.kernel.org/r/20190216172228.512444498@linutronix.de
2019-02-17 01:13:09 +08:00
* Determine the number of vectors which need interrupt affinities
* assigned. If the pre/post request exhausts the available vectors
* then nothing to do here except for invoking the calc_sets()
* callback so the device driver can adjust to the situation.
*/
if (nvecs > affd->pre_vectors + affd->post_vectors)
genirq/affinity: Add new callback for (re)calculating interrupt sets The interrupt affinity spreading mechanism supports to spread out affinities for one or more interrupt sets. A interrupt set contains one or more interrupts. Each set is mapped to a specific functionality of a device, e.g. general I/O queues and read I/O queus of multiqueue block devices. The number of interrupts per set is defined by the driver. It depends on the total number of available interrupts for the device, which is determined by the PCI capabilites and the availability of underlying CPU resources, and the number of queues which the device provides and the driver wants to instantiate. The driver passes initial configuration for the interrupt allocation via a pointer to struct irq_affinity. Right now the allocation mechanism is complex as it requires to have a loop in the driver to determine the maximum number of interrupts which are provided by the PCI capabilities and the underlying CPU resources. This loop would have to be replicated in every driver which wants to utilize this mechanism. That's unwanted code duplication and error prone. In order to move this into generic facilities it is required to have a mechanism, which allows the recalculation of the interrupt sets and their size, in the core code. As the core code does not have any knowledge about the underlying device, a driver specific callback is required in struct irq_affinity, which can be invoked by the core code. The callback gets the number of available interupts as an argument, so the driver can calculate the corresponding number and size of interrupt sets. At the moment the struct irq_affinity pointer which is handed in from the driver and passed through to several core functions is marked 'const', but for the callback to be able to modify the data in the struct it's required to remove the 'const' qualifier. Add the optional callback to struct irq_affinity, which allows drivers to recalculate the number and size of interrupt sets and remove the 'const' qualifier. For simple invocations, which do not supply a callback, a default callback is installed, which just sets nr_sets to 1 and transfers the number of spreadable vectors to the set_size array at index 0. This is for now guarded by a check for nr_sets != 0 to keep the NVME driver working until it is converted to the callback mechanism. To make sure that the driver configuration is correct under all circumstances the callback is invoked even when there are no interrupts for queues left, i.e. the pre/post requirements already exhaust the numner of available interrupts. At the PCI layer irq_create_affinity_masks() has to be invoked even for the case where the legacy interrupt is used. That ensures that the callback is invoked and the device driver can adjust to that situation. [ tglx: Fixed the simple case (no sets required). Moved the sanity check for nr_sets after the invocation of the callback so it catches broken drivers. Fixed the kernel doc comments for struct irq_affinity and de-'This patch'-ed the changelog ] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bjorn Helgaas <helgaas@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Sagi Grimberg <sagi@grimberg.me> Cc: linux-nvme@lists.infradead.org Cc: linux-pci@vger.kernel.org Cc: Keith Busch <keith.busch@intel.com> Cc: Sumit Saxena <sumit.saxena@broadcom.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com> Link: https://lkml.kernel.org/r/20190216172228.512444498@linutronix.de
2019-02-17 01:13:09 +08:00
affvecs = nvecs - affd->pre_vectors - affd->post_vectors;
else
affvecs = 0;
/*
* Simple invocations do not provide a calc_sets() callback. Install
* the generic one.
genirq/affinity: Add new callback for (re)calculating interrupt sets The interrupt affinity spreading mechanism supports to spread out affinities for one or more interrupt sets. A interrupt set contains one or more interrupts. Each set is mapped to a specific functionality of a device, e.g. general I/O queues and read I/O queus of multiqueue block devices. The number of interrupts per set is defined by the driver. It depends on the total number of available interrupts for the device, which is determined by the PCI capabilites and the availability of underlying CPU resources, and the number of queues which the device provides and the driver wants to instantiate. The driver passes initial configuration for the interrupt allocation via a pointer to struct irq_affinity. Right now the allocation mechanism is complex as it requires to have a loop in the driver to determine the maximum number of interrupts which are provided by the PCI capabilities and the underlying CPU resources. This loop would have to be replicated in every driver which wants to utilize this mechanism. That's unwanted code duplication and error prone. In order to move this into generic facilities it is required to have a mechanism, which allows the recalculation of the interrupt sets and their size, in the core code. As the core code does not have any knowledge about the underlying device, a driver specific callback is required in struct irq_affinity, which can be invoked by the core code. The callback gets the number of available interupts as an argument, so the driver can calculate the corresponding number and size of interrupt sets. At the moment the struct irq_affinity pointer which is handed in from the driver and passed through to several core functions is marked 'const', but for the callback to be able to modify the data in the struct it's required to remove the 'const' qualifier. Add the optional callback to struct irq_affinity, which allows drivers to recalculate the number and size of interrupt sets and remove the 'const' qualifier. For simple invocations, which do not supply a callback, a default callback is installed, which just sets nr_sets to 1 and transfers the number of spreadable vectors to the set_size array at index 0. This is for now guarded by a check for nr_sets != 0 to keep the NVME driver working until it is converted to the callback mechanism. To make sure that the driver configuration is correct under all circumstances the callback is invoked even when there are no interrupts for queues left, i.e. the pre/post requirements already exhaust the numner of available interrupts. At the PCI layer irq_create_affinity_masks() has to be invoked even for the case where the legacy interrupt is used. That ensures that the callback is invoked and the device driver can adjust to that situation. [ tglx: Fixed the simple case (no sets required). Moved the sanity check for nr_sets after the invocation of the callback so it catches broken drivers. Fixed the kernel doc comments for struct irq_affinity and de-'This patch'-ed the changelog ] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bjorn Helgaas <helgaas@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Sagi Grimberg <sagi@grimberg.me> Cc: linux-nvme@lists.infradead.org Cc: linux-pci@vger.kernel.org Cc: Keith Busch <keith.busch@intel.com> Cc: Sumit Saxena <sumit.saxena@broadcom.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com> Link: https://lkml.kernel.org/r/20190216172228.512444498@linutronix.de
2019-02-17 01:13:09 +08:00
*/
if (!affd->calc_sets)
genirq/affinity: Add new callback for (re)calculating interrupt sets The interrupt affinity spreading mechanism supports to spread out affinities for one or more interrupt sets. A interrupt set contains one or more interrupts. Each set is mapped to a specific functionality of a device, e.g. general I/O queues and read I/O queus of multiqueue block devices. The number of interrupts per set is defined by the driver. It depends on the total number of available interrupts for the device, which is determined by the PCI capabilites and the availability of underlying CPU resources, and the number of queues which the device provides and the driver wants to instantiate. The driver passes initial configuration for the interrupt allocation via a pointer to struct irq_affinity. Right now the allocation mechanism is complex as it requires to have a loop in the driver to determine the maximum number of interrupts which are provided by the PCI capabilities and the underlying CPU resources. This loop would have to be replicated in every driver which wants to utilize this mechanism. That's unwanted code duplication and error prone. In order to move this into generic facilities it is required to have a mechanism, which allows the recalculation of the interrupt sets and their size, in the core code. As the core code does not have any knowledge about the underlying device, a driver specific callback is required in struct irq_affinity, which can be invoked by the core code. The callback gets the number of available interupts as an argument, so the driver can calculate the corresponding number and size of interrupt sets. At the moment the struct irq_affinity pointer which is handed in from the driver and passed through to several core functions is marked 'const', but for the callback to be able to modify the data in the struct it's required to remove the 'const' qualifier. Add the optional callback to struct irq_affinity, which allows drivers to recalculate the number and size of interrupt sets and remove the 'const' qualifier. For simple invocations, which do not supply a callback, a default callback is installed, which just sets nr_sets to 1 and transfers the number of spreadable vectors to the set_size array at index 0. This is for now guarded by a check for nr_sets != 0 to keep the NVME driver working until it is converted to the callback mechanism. To make sure that the driver configuration is correct under all circumstances the callback is invoked even when there are no interrupts for queues left, i.e. the pre/post requirements already exhaust the numner of available interrupts. At the PCI layer irq_create_affinity_masks() has to be invoked even for the case where the legacy interrupt is used. That ensures that the callback is invoked and the device driver can adjust to that situation. [ tglx: Fixed the simple case (no sets required). Moved the sanity check for nr_sets after the invocation of the callback so it catches broken drivers. Fixed the kernel doc comments for struct irq_affinity and de-'This patch'-ed the changelog ] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bjorn Helgaas <helgaas@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Sagi Grimberg <sagi@grimberg.me> Cc: linux-nvme@lists.infradead.org Cc: linux-pci@vger.kernel.org Cc: Keith Busch <keith.busch@intel.com> Cc: Sumit Saxena <sumit.saxena@broadcom.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com> Link: https://lkml.kernel.org/r/20190216172228.512444498@linutronix.de
2019-02-17 01:13:09 +08:00
affd->calc_sets = default_calc_sets;
/* Recalculate the sets */
affd->calc_sets(affd, affvecs);
genirq/affinity: Store interrupt sets size in struct irq_affinity The interrupt affinity spreading mechanism supports to spread out affinities for one or more interrupt sets. A interrupt set contains one or more interrupts. Each set is mapped to a specific functionality of a device, e.g. general I/O queues and read I/O queus of multiqueue block devices. The number of interrupts per set is defined by the driver. It depends on the total number of available interrupts for the device, which is determined by the PCI capabilites and the availability of underlying CPU resources, and the number of queues which the device provides and the driver wants to instantiate. The driver passes initial configuration for the interrupt allocation via a pointer to struct irq_affinity. Right now the allocation mechanism is complex as it requires to have a loop in the driver to determine the maximum number of interrupts which are provided by the PCI capabilities and the underlying CPU resources. This loop would have to be replicated in every driver which wants to utilize this mechanism. That's unwanted code duplication and error prone. In order to move this into generic facilities it is required to have a mechanism, which allows the recalculation of the interrupt sets and their size, in the core code. As the core code does not have any knowledge about the underlying device, a driver specific callback will be added to struct affinity_desc, which will be invoked by the core code. The callback will get the number of available interupts as an argument, so the driver can calculate the corresponding number and size of interrupt sets. To support this, two modifications for the handling of struct irq_affinity are required: 1) The (optional) interrupt sets size information is contained in a separate array of integers and struct irq_affinity contains a pointer to it. This is cumbersome and as the maximum number of interrupt sets is small, there is no reason to have separate storage. Moving the size array into struct affinity_desc avoids indirections and makes the code simpler. 2) At the moment the struct irq_affinity pointer which is handed in from the driver and passed through to several core functions is marked 'const'. With the upcoming callback to recalculate the number and size of interrupt sets, it's necessary to remove the 'const' qualifier. Otherwise the callback would not be able to update the data. Implement #1 and store the interrupt sets size in 'struct irq_affinity'. No functional change. [ tglx: Fixed the memcpy() size so it won't copy beyond the size of the source. Fixed the kernel doc comments for struct irq_affinity and de-'This patch'-ed the changelog ] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bjorn Helgaas <helgaas@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Sagi Grimberg <sagi@grimberg.me> Cc: linux-nvme@lists.infradead.org Cc: linux-pci@vger.kernel.org Cc: Keith Busch <keith.busch@intel.com> Cc: Sumit Saxena <sumit.saxena@broadcom.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com> Link: https://lkml.kernel.org/r/20190216172228.423723127@linutronix.de
2019-02-17 01:13:08 +08:00
if (WARN_ON_ONCE(affd->nr_sets > IRQ_AFFINITY_MAX_SETS))
return NULL;
genirq/affinity: Add new callback for (re)calculating interrupt sets The interrupt affinity spreading mechanism supports to spread out affinities for one or more interrupt sets. A interrupt set contains one or more interrupts. Each set is mapped to a specific functionality of a device, e.g. general I/O queues and read I/O queus of multiqueue block devices. The number of interrupts per set is defined by the driver. It depends on the total number of available interrupts for the device, which is determined by the PCI capabilites and the availability of underlying CPU resources, and the number of queues which the device provides and the driver wants to instantiate. The driver passes initial configuration for the interrupt allocation via a pointer to struct irq_affinity. Right now the allocation mechanism is complex as it requires to have a loop in the driver to determine the maximum number of interrupts which are provided by the PCI capabilities and the underlying CPU resources. This loop would have to be replicated in every driver which wants to utilize this mechanism. That's unwanted code duplication and error prone. In order to move this into generic facilities it is required to have a mechanism, which allows the recalculation of the interrupt sets and their size, in the core code. As the core code does not have any knowledge about the underlying device, a driver specific callback is required in struct irq_affinity, which can be invoked by the core code. The callback gets the number of available interupts as an argument, so the driver can calculate the corresponding number and size of interrupt sets. At the moment the struct irq_affinity pointer which is handed in from the driver and passed through to several core functions is marked 'const', but for the callback to be able to modify the data in the struct it's required to remove the 'const' qualifier. Add the optional callback to struct irq_affinity, which allows drivers to recalculate the number and size of interrupt sets and remove the 'const' qualifier. For simple invocations, which do not supply a callback, a default callback is installed, which just sets nr_sets to 1 and transfers the number of spreadable vectors to the set_size array at index 0. This is for now guarded by a check for nr_sets != 0 to keep the NVME driver working until it is converted to the callback mechanism. To make sure that the driver configuration is correct under all circumstances the callback is invoked even when there are no interrupts for queues left, i.e. the pre/post requirements already exhaust the numner of available interrupts. At the PCI layer irq_create_affinity_masks() has to be invoked even for the case where the legacy interrupt is used. That ensures that the callback is invoked and the device driver can adjust to that situation. [ tglx: Fixed the simple case (no sets required). Moved the sanity check for nr_sets after the invocation of the callback so it catches broken drivers. Fixed the kernel doc comments for struct irq_affinity and de-'This patch'-ed the changelog ] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bjorn Helgaas <helgaas@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Sagi Grimberg <sagi@grimberg.me> Cc: linux-nvme@lists.infradead.org Cc: linux-pci@vger.kernel.org Cc: Keith Busch <keith.busch@intel.com> Cc: Sumit Saxena <sumit.saxena@broadcom.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com> Link: https://lkml.kernel.org/r/20190216172228.512444498@linutronix.de
2019-02-17 01:13:09 +08:00
/* Nothing to assign? */
if (!affvecs)
return NULL;
masks = kcalloc(nvecs, sizeof(*masks), GFP_KERNEL);
if (!masks)
return NULL;
/* Fill out vectors at the beginning that don't need affinity */
for (curvec = 0; curvec < affd->pre_vectors; curvec++)
genirq/core: Introduce struct irq_affinity_desc The interrupt affinity management uses straight cpumask pointers to convey the automatically assigned affinity masks for managed interrupts. The core interrupt descriptor allocation also decides based on the pointer being non NULL whether an interrupt is managed or not. Devices which use managed interrupts usually have two classes of interrupts: - Interrupts for multiple device queues - Interrupts for general device management Currently both classes are treated the same way, i.e. as managed interrupts. The general interrupts get the default affinity mask assigned while the device queue interrupts are spread out over the possible CPUs. Treating the general interrupts as managed is both a limitation and under certain circumstances a bug. Assume the following situation: default_irq_affinity = 4..7 So if CPUs 4-7 are offlined, then the core code will shut down the device management interrupts because the last CPU in their affinity mask went offline. It's also a limitation because it's desired to allow manual placement of the general device interrupts for various reasons. If they are marked managed then the interrupt affinity setting from both user and kernel space is disabled. To remedy that situation it's required to convey more information than the cpumasks through various interfaces related to interrupt descriptor allocation. Instead of adding yet another argument, create a new data structure 'irq_affinity_desc' which for now just contains the cpumask. This struct can be expanded to convey auxilliary information in the next step. No functional change, just preparatory work. [ tglx: Simplified logic and clarified changelog ] Suggested-by: Thomas Gleixner <tglx@linutronix.de> Suggested-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Dou Liyang <douliyangs@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-pci@vger.kernel.org Cc: kashyap.desai@broadcom.com Cc: shivasharan.srikanteshwara@broadcom.com Cc: sumit.saxena@broadcom.com Cc: ming.lei@redhat.com Cc: hch@lst.de Cc: douliyang1@huawei.com Link: https://lkml.kernel.org/r/20181204155122.6327-2-douliyangs@gmail.com
2018-12-04 23:51:20 +08:00
cpumask_copy(&masks[curvec].mask, irq_default_affinity);
genirq/affinity: Add new callback for (re)calculating interrupt sets The interrupt affinity spreading mechanism supports to spread out affinities for one or more interrupt sets. A interrupt set contains one or more interrupts. Each set is mapped to a specific functionality of a device, e.g. general I/O queues and read I/O queus of multiqueue block devices. The number of interrupts per set is defined by the driver. It depends on the total number of available interrupts for the device, which is determined by the PCI capabilites and the availability of underlying CPU resources, and the number of queues which the device provides and the driver wants to instantiate. The driver passes initial configuration for the interrupt allocation via a pointer to struct irq_affinity. Right now the allocation mechanism is complex as it requires to have a loop in the driver to determine the maximum number of interrupts which are provided by the PCI capabilities and the underlying CPU resources. This loop would have to be replicated in every driver which wants to utilize this mechanism. That's unwanted code duplication and error prone. In order to move this into generic facilities it is required to have a mechanism, which allows the recalculation of the interrupt sets and their size, in the core code. As the core code does not have any knowledge about the underlying device, a driver specific callback is required in struct irq_affinity, which can be invoked by the core code. The callback gets the number of available interupts as an argument, so the driver can calculate the corresponding number and size of interrupt sets. At the moment the struct irq_affinity pointer which is handed in from the driver and passed through to several core functions is marked 'const', but for the callback to be able to modify the data in the struct it's required to remove the 'const' qualifier. Add the optional callback to struct irq_affinity, which allows drivers to recalculate the number and size of interrupt sets and remove the 'const' qualifier. For simple invocations, which do not supply a callback, a default callback is installed, which just sets nr_sets to 1 and transfers the number of spreadable vectors to the set_size array at index 0. This is for now guarded by a check for nr_sets != 0 to keep the NVME driver working until it is converted to the callback mechanism. To make sure that the driver configuration is correct under all circumstances the callback is invoked even when there are no interrupts for queues left, i.e. the pre/post requirements already exhaust the numner of available interrupts. At the PCI layer irq_create_affinity_masks() has to be invoked even for the case where the legacy interrupt is used. That ensures that the callback is invoked and the device driver can adjust to that situation. [ tglx: Fixed the simple case (no sets required). Moved the sanity check for nr_sets after the invocation of the callback so it catches broken drivers. Fixed the kernel doc comments for struct irq_affinity and de-'This patch'-ed the changelog ] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bjorn Helgaas <helgaas@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Sagi Grimberg <sagi@grimberg.me> Cc: linux-nvme@lists.infradead.org Cc: linux-pci@vger.kernel.org Cc: Keith Busch <keith.busch@intel.com> Cc: Sumit Saxena <sumit.saxena@broadcom.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com> Link: https://lkml.kernel.org/r/20190216172228.512444498@linutronix.de
2019-02-17 01:13:09 +08:00
/*
* Spread on present CPUs starting from affd->pre_vectors. If we
* have multiple sets, build each sets affinity mask separately.
*/
genirq/affinity: Add new callback for (re)calculating interrupt sets The interrupt affinity spreading mechanism supports to spread out affinities for one or more interrupt sets. A interrupt set contains one or more interrupts. Each set is mapped to a specific functionality of a device, e.g. general I/O queues and read I/O queus of multiqueue block devices. The number of interrupts per set is defined by the driver. It depends on the total number of available interrupts for the device, which is determined by the PCI capabilites and the availability of underlying CPU resources, and the number of queues which the device provides and the driver wants to instantiate. The driver passes initial configuration for the interrupt allocation via a pointer to struct irq_affinity. Right now the allocation mechanism is complex as it requires to have a loop in the driver to determine the maximum number of interrupts which are provided by the PCI capabilities and the underlying CPU resources. This loop would have to be replicated in every driver which wants to utilize this mechanism. That's unwanted code duplication and error prone. In order to move this into generic facilities it is required to have a mechanism, which allows the recalculation of the interrupt sets and their size, in the core code. As the core code does not have any knowledge about the underlying device, a driver specific callback is required in struct irq_affinity, which can be invoked by the core code. The callback gets the number of available interupts as an argument, so the driver can calculate the corresponding number and size of interrupt sets. At the moment the struct irq_affinity pointer which is handed in from the driver and passed through to several core functions is marked 'const', but for the callback to be able to modify the data in the struct it's required to remove the 'const' qualifier. Add the optional callback to struct irq_affinity, which allows drivers to recalculate the number and size of interrupt sets and remove the 'const' qualifier. For simple invocations, which do not supply a callback, a default callback is installed, which just sets nr_sets to 1 and transfers the number of spreadable vectors to the set_size array at index 0. This is for now guarded by a check for nr_sets != 0 to keep the NVME driver working until it is converted to the callback mechanism. To make sure that the driver configuration is correct under all circumstances the callback is invoked even when there are no interrupts for queues left, i.e. the pre/post requirements already exhaust the numner of available interrupts. At the PCI layer irq_create_affinity_masks() has to be invoked even for the case where the legacy interrupt is used. That ensures that the callback is invoked and the device driver can adjust to that situation. [ tglx: Fixed the simple case (no sets required). Moved the sanity check for nr_sets after the invocation of the callback so it catches broken drivers. Fixed the kernel doc comments for struct irq_affinity and de-'This patch'-ed the changelog ] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bjorn Helgaas <helgaas@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Sagi Grimberg <sagi@grimberg.me> Cc: linux-nvme@lists.infradead.org Cc: linux-pci@vger.kernel.org Cc: Keith Busch <keith.busch@intel.com> Cc: Sumit Saxena <sumit.saxena@broadcom.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com> Link: https://lkml.kernel.org/r/20190216172228.512444498@linutronix.de
2019-02-17 01:13:09 +08:00
for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) {
unsigned int this_vecs = affd->set_size[i];
int ret;
ret = irq_build_affinity_masks(curvec, this_vecs,
curvec, masks);
if (ret) {
kfree(masks);
return NULL;
}
curvec += this_vecs;
usedvecs += this_vecs;
}
/* Fill out vectors at the end that don't need affinity */
genirq/affinity: Spread irq vectors among present CPUs as far as possible Commit 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs") tried to spread the interrupts accross all possible CPUs to make sure that in case of phsyical hotplug (e.g. virtualization) the CPUs which get plugged in after the device was initialized are targeted by a hardware queue and the corresponding interrupt. This has a downside in cases where the ACPI tables claim that there are more possible CPUs than present CPUs and the number of interrupts to spread out is smaller than the number of possible CPUs. These bogus ACPI tables are unfortunately not uncommon. In such a case the vector spreading algorithm assigns interrupts to CPUs which can never be utilized and as a consequence these interrupts are unused instead of being mapped to present CPUs. As a result the performance of the device is suboptimal. To fix this spread the interrupt vectors in two stages: 1) Spread as many interrupts as possible among the present CPUs 2) Spread the remaining vectors among non present CPUs On a 8 core system, where CPU 0-3 are present and CPU 4-7 are not present, for a device with 4 queues the resulting interrupt affinity is: 1) Before 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs") irq 39, cpu list 0 irq 40, cpu list 1 irq 41, cpu list 2 irq 42, cpu list 3 2) With 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs") irq 39, cpu list 0-2 irq 40, cpu list 3-4,6 irq 41, cpu list 5 irq 42, cpu list 7 3) With the refined vector spread applied: irq 39, cpu list 0,4 irq 40, cpu list 1,6 irq 41, cpu list 2,5 irq 42, cpu list 3,7 On a 8 core system, where all CPUs are present the resulting interrupt affinity for the 4 queues is: irq 39, cpu list 0,1 irq 40, cpu list 2,3 irq 41, cpu list 4,5 irq 42, cpu list 6,7 This is independent of the number of CPUs which are online at the point of initialization because in such a system the offline CPUs can be easily onlined afterwards, while in non-present CPUs need to be plugged physically or virtually which requires external interaction. The downside of this approach is that in case of physical hotplug the interrupt vector spreading might be suboptimal when CPUs 4-7 are physically plugged. Suboptimal from a NUMA point of view and due to the single target nature of interrupt affinities the later plugged CPUs might not be targeted by interrupts at all. Though, physical hotplug systems are not the common case while the broken ACPI table disease is wide spread. So it's preferred to have as many interrupts as possible utilized at the point where the device is initialized. Block multi-queue devices like NVME create a hardware queue per possible CPU, so the goal of commit 84676c1f21 to assign one interrupt vector per possible CPU is still achieved even with physical/virtual hotplug. [ tglx: Changed from online to present CPUs for the first spreading stage, renamed variables for readability sake, added comments and massaged changelog ] Reported-by: Laurence Oberman <loberman@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Link: https://lkml.kernel.org/r/20180308105358.1506-5-ming.lei@redhat.com
2018-03-08 18:53:58 +08:00
if (usedvecs >= affvecs)
curvec = affd->pre_vectors + affvecs;
else
curvec = affd->pre_vectors + usedvecs;
for (; curvec < nvecs; curvec++)
genirq/core: Introduce struct irq_affinity_desc The interrupt affinity management uses straight cpumask pointers to convey the automatically assigned affinity masks for managed interrupts. The core interrupt descriptor allocation also decides based on the pointer being non NULL whether an interrupt is managed or not. Devices which use managed interrupts usually have two classes of interrupts: - Interrupts for multiple device queues - Interrupts for general device management Currently both classes are treated the same way, i.e. as managed interrupts. The general interrupts get the default affinity mask assigned while the device queue interrupts are spread out over the possible CPUs. Treating the general interrupts as managed is both a limitation and under certain circumstances a bug. Assume the following situation: default_irq_affinity = 4..7 So if CPUs 4-7 are offlined, then the core code will shut down the device management interrupts because the last CPU in their affinity mask went offline. It's also a limitation because it's desired to allow manual placement of the general device interrupts for various reasons. If they are marked managed then the interrupt affinity setting from both user and kernel space is disabled. To remedy that situation it's required to convey more information than the cpumasks through various interfaces related to interrupt descriptor allocation. Instead of adding yet another argument, create a new data structure 'irq_affinity_desc' which for now just contains the cpumask. This struct can be expanded to convey auxilliary information in the next step. No functional change, just preparatory work. [ tglx: Simplified logic and clarified changelog ] Suggested-by: Thomas Gleixner <tglx@linutronix.de> Suggested-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Dou Liyang <douliyangs@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-pci@vger.kernel.org Cc: kashyap.desai@broadcom.com Cc: shivasharan.srikanteshwara@broadcom.com Cc: sumit.saxena@broadcom.com Cc: ming.lei@redhat.com Cc: hch@lst.de Cc: douliyang1@huawei.com Link: https://lkml.kernel.org/r/20181204155122.6327-2-douliyangs@gmail.com
2018-12-04 23:51:20 +08:00
cpumask_copy(&masks[curvec].mask, irq_default_affinity);
genirq/affinity: Spread irq vectors among present CPUs as far as possible Commit 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs") tried to spread the interrupts accross all possible CPUs to make sure that in case of phsyical hotplug (e.g. virtualization) the CPUs which get plugged in after the device was initialized are targeted by a hardware queue and the corresponding interrupt. This has a downside in cases where the ACPI tables claim that there are more possible CPUs than present CPUs and the number of interrupts to spread out is smaller than the number of possible CPUs. These bogus ACPI tables are unfortunately not uncommon. In such a case the vector spreading algorithm assigns interrupts to CPUs which can never be utilized and as a consequence these interrupts are unused instead of being mapped to present CPUs. As a result the performance of the device is suboptimal. To fix this spread the interrupt vectors in two stages: 1) Spread as many interrupts as possible among the present CPUs 2) Spread the remaining vectors among non present CPUs On a 8 core system, where CPU 0-3 are present and CPU 4-7 are not present, for a device with 4 queues the resulting interrupt affinity is: 1) Before 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs") irq 39, cpu list 0 irq 40, cpu list 1 irq 41, cpu list 2 irq 42, cpu list 3 2) With 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs") irq 39, cpu list 0-2 irq 40, cpu list 3-4,6 irq 41, cpu list 5 irq 42, cpu list 7 3) With the refined vector spread applied: irq 39, cpu list 0,4 irq 40, cpu list 1,6 irq 41, cpu list 2,5 irq 42, cpu list 3,7 On a 8 core system, where all CPUs are present the resulting interrupt affinity for the 4 queues is: irq 39, cpu list 0,1 irq 40, cpu list 2,3 irq 41, cpu list 4,5 irq 42, cpu list 6,7 This is independent of the number of CPUs which are online at the point of initialization because in such a system the offline CPUs can be easily onlined afterwards, while in non-present CPUs need to be plugged physically or virtually which requires external interaction. The downside of this approach is that in case of physical hotplug the interrupt vector spreading might be suboptimal when CPUs 4-7 are physically plugged. Suboptimal from a NUMA point of view and due to the single target nature of interrupt affinities the later plugged CPUs might not be targeted by interrupts at all. Though, physical hotplug systems are not the common case while the broken ACPI table disease is wide spread. So it's preferred to have as many interrupts as possible utilized at the point where the device is initialized. Block multi-queue devices like NVME create a hardware queue per possible CPU, so the goal of commit 84676c1f21 to assign one interrupt vector per possible CPU is still achieved even with physical/virtual hotplug. [ tglx: Changed from online to present CPUs for the first spreading stage, renamed variables for readability sake, added comments and massaged changelog ] Reported-by: Laurence Oberman <loberman@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Link: https://lkml.kernel.org/r/20180308105358.1506-5-ming.lei@redhat.com
2018-03-08 18:53:58 +08:00
genirq/affinity: Add is_managed to struct irq_affinity_desc Devices which use managed interrupts usually have two classes of interrupts: - Interrupts for multiple device queues - Interrupts for general device management Currently both classes are treated the same way, i.e. as managed interrupts. The general interrupts get the default affinity mask assigned while the device queue interrupts are spread out over the possible CPUs. Treating the general interrupts as managed is both a limitation and under certain circumstances a bug. Assume the following situation: default_irq_affinity = 4..7 So if CPUs 4-7 are offlined, then the core code will shut down the device management interrupts because the last CPU in their affinity mask went offline. It's also a limitation because it's desired to allow manual placement of the general device interrupts for various reasons. If they are marked managed then the interrupt affinity setting from both user and kernel space is disabled. That limitation was reported by Kashyap and Sumit. Expand struct irq_affinity_desc with a new bit 'is_managed' which is set for truly managed interrupts (queue interrupts) and cleared for the general device interrupts. [ tglx: Simplify code and massage changelog ] Reported-by: Kashyap Desai <kashyap.desai@broadcom.com> Reported-by: Sumit Saxena <sumit.saxena@broadcom.com> Signed-off-by: Dou Liyang <douliyangs@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-pci@vger.kernel.org Cc: shivasharan.srikanteshwara@broadcom.com Cc: ming.lei@redhat.com Cc: hch@lst.de Cc: bhelgaas@google.com Cc: douliyang1@huawei.com Link: https://lkml.kernel.org/r/20181204155122.6327-3-douliyangs@gmail.com
2018-12-04 23:51:21 +08:00
/* Mark the managed interrupts */
for (i = affd->pre_vectors; i < nvecs - affd->post_vectors; i++)
masks[i].is_managed = 1;
return masks;
}
/**
* irq_calc_affinity_vectors - Calculate the optimal number of vectors
* @minvec: The minimum number of vectors available
* @maxvec: The maximum number of vectors available
* @affd: Description of the affinity requirements
*/
unsigned int irq_calc_affinity_vectors(unsigned int minvec, unsigned int maxvec,
const struct irq_affinity *affd)
{
unsigned int resv = affd->pre_vectors + affd->post_vectors;
unsigned int set_vecs;
if (resv > minvec)
return 0;
genirq/affinity: Add new callback for (re)calculating interrupt sets The interrupt affinity spreading mechanism supports to spread out affinities for one or more interrupt sets. A interrupt set contains one or more interrupts. Each set is mapped to a specific functionality of a device, e.g. general I/O queues and read I/O queus of multiqueue block devices. The number of interrupts per set is defined by the driver. It depends on the total number of available interrupts for the device, which is determined by the PCI capabilites and the availability of underlying CPU resources, and the number of queues which the device provides and the driver wants to instantiate. The driver passes initial configuration for the interrupt allocation via a pointer to struct irq_affinity. Right now the allocation mechanism is complex as it requires to have a loop in the driver to determine the maximum number of interrupts which are provided by the PCI capabilities and the underlying CPU resources. This loop would have to be replicated in every driver which wants to utilize this mechanism. That's unwanted code duplication and error prone. In order to move this into generic facilities it is required to have a mechanism, which allows the recalculation of the interrupt sets and their size, in the core code. As the core code does not have any knowledge about the underlying device, a driver specific callback is required in struct irq_affinity, which can be invoked by the core code. The callback gets the number of available interupts as an argument, so the driver can calculate the corresponding number and size of interrupt sets. At the moment the struct irq_affinity pointer which is handed in from the driver and passed through to several core functions is marked 'const', but for the callback to be able to modify the data in the struct it's required to remove the 'const' qualifier. Add the optional callback to struct irq_affinity, which allows drivers to recalculate the number and size of interrupt sets and remove the 'const' qualifier. For simple invocations, which do not supply a callback, a default callback is installed, which just sets nr_sets to 1 and transfers the number of spreadable vectors to the set_size array at index 0. This is for now guarded by a check for nr_sets != 0 to keep the NVME driver working until it is converted to the callback mechanism. To make sure that the driver configuration is correct under all circumstances the callback is invoked even when there are no interrupts for queues left, i.e. the pre/post requirements already exhaust the numner of available interrupts. At the PCI layer irq_create_affinity_masks() has to be invoked even for the case where the legacy interrupt is used. That ensures that the callback is invoked and the device driver can adjust to that situation. [ tglx: Fixed the simple case (no sets required). Moved the sanity check for nr_sets after the invocation of the callback so it catches broken drivers. Fixed the kernel doc comments for struct irq_affinity and de-'This patch'-ed the changelog ] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bjorn Helgaas <helgaas@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Sagi Grimberg <sagi@grimberg.me> Cc: linux-nvme@lists.infradead.org Cc: linux-pci@vger.kernel.org Cc: Keith Busch <keith.busch@intel.com> Cc: Sumit Saxena <sumit.saxena@broadcom.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com> Link: https://lkml.kernel.org/r/20190216172228.512444498@linutronix.de
2019-02-17 01:13:09 +08:00
if (affd->calc_sets) {
set_vecs = maxvec - resv;
} else {
cpus_read_lock();
set_vecs = cpumask_weight(cpu_possible_mask);
cpus_read_unlock();
}
return resv + min(set_vecs, maxvec - resv);
}