linux/kernel/irq/matrix.c

512 lines
13 KiB
C
Raw Normal View History

// SPDX-License-Identifier: GPL-2.0
// Copyright (C) 2017 Thomas Gleixner <tglx@linutronix.de>
genirq: Implement bitmap matrix allocator Implement the infrastructure for a simple bitmap based allocator, which will replace the x86 vector allocator. It's in the core code as other architectures might be able to reuse/extend it. For now it only implements allocations for single CPUs, but it's simple to add multi CPU allocation support if required. The concept is rather simple: Global information: system_vector bitmap global accounting PerCPU information: allocation bitmap managed allocation bitmap local accounting The system vector bitmap is used to exclude vectors system wide from the allocation space. The allocation bitmap is used to keep track of per cpu used vectors. The managed allocation bitmap is used to reserve vectors for managed interrupts. When a regular (non managed) interrupt allocation happens then the following rule applies: tmpmap = system_map | alloc_map | managed_map find_zero_bit(tmpmap) Oring the bitmaps together gives the real available space. The same rule applies for reserving a managed interrupt vector. But contrary to the regular interrupts the reservation only marks the bit in the managed map and therefor excludes it from the regular allocations. The managed map is only cleaned out when the a managed interrupt is completely released and it stays alive accross CPU offline/online operations. For managed interrupt allocations the rule is: tmpmap = managed_map & ~alloc_map find_first_bit(tmpmap) This returns the first bit which is in the managed map, but not yet allocated in the allocation map. The allocation marks it in the allocation map and hands it back to the caller for use. The rest of the code are helper functions to handle the various requirements and the accounting which are necessary to replace the x86 vector allocation code. The result is a single patch as the evolution of this infrastructure cannot be represented in bits and pieces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Yu Chen <yu.c.chen@intel.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/20170913213153.185437174@linutronix.de
2017-09-14 05:29:14 +08:00
#include <linux/spinlock.h>
#include <linux/seq_file.h>
#include <linux/bitmap.h>
#include <linux/percpu.h>
#include <linux/cpu.h>
#include <linux/irq.h>
#define IRQ_MATRIX_SIZE (BITS_TO_LONGS(IRQ_MATRIX_BITS))
genirq: Implement bitmap matrix allocator Implement the infrastructure for a simple bitmap based allocator, which will replace the x86 vector allocator. It's in the core code as other architectures might be able to reuse/extend it. For now it only implements allocations for single CPUs, but it's simple to add multi CPU allocation support if required. The concept is rather simple: Global information: system_vector bitmap global accounting PerCPU information: allocation bitmap managed allocation bitmap local accounting The system vector bitmap is used to exclude vectors system wide from the allocation space. The allocation bitmap is used to keep track of per cpu used vectors. The managed allocation bitmap is used to reserve vectors for managed interrupts. When a regular (non managed) interrupt allocation happens then the following rule applies: tmpmap = system_map | alloc_map | managed_map find_zero_bit(tmpmap) Oring the bitmaps together gives the real available space. The same rule applies for reserving a managed interrupt vector. But contrary to the regular interrupts the reservation only marks the bit in the managed map and therefor excludes it from the regular allocations. The managed map is only cleaned out when the a managed interrupt is completely released and it stays alive accross CPU offline/online operations. For managed interrupt allocations the rule is: tmpmap = managed_map & ~alloc_map find_first_bit(tmpmap) This returns the first bit which is in the managed map, but not yet allocated in the allocation map. The allocation marks it in the allocation map and hands it back to the caller for use. The rest of the code are helper functions to handle the various requirements and the accounting which are necessary to replace the x86 vector allocation code. The result is a single patch as the evolution of this infrastructure cannot be represented in bits and pieces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Yu Chen <yu.c.chen@intel.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/20170913213153.185437174@linutronix.de
2017-09-14 05:29:14 +08:00
struct cpumap {
unsigned int available;
unsigned int allocated;
unsigned int managed;
genirq/matrix: Improve target CPU selection for managed interrupts. On large systems with multiple devices of the same class (e.g. NVMe disks, using managed interrupts), the kernel can affinitize these interrupts to a small subset of CPUs instead of spreading them out evenly. irq_matrix_alloc_managed() tries to select the CPU in the supplied cpumask of possible target CPUs which has the lowest number of interrupt vectors allocated. This is done by searching the CPU with the highest number of available vectors. While this is correct for non-managed CPUs it can select the wrong CPU for managed interrupts. Under certain constellations this results in affinitizing the managed interrupts of several devices to a single CPU in a set. The book keeping of available vectors works the following way: 1) Non-managed interrupts: available is decremented when the interrupt is actually requested by the device driver and a vector is assigned. It's incremented when the interrupt and the vector are freed. 2) Managed interrupts: Managed interrupts guarantee vector reservation when the MSI/MSI-X functionality of a device is enabled, which is achieved by reserving vectors in the bitmaps of the possible target CPUs. This reservation decrements the available count on each possible target CPU. When the interrupt is requested by the device driver then a vector is allocated from the reserved region. The operation is reversed when the interrupt is freed by the device driver. Neither of these operations affect the available count. The reservation persist up to the point where the MSI/MSI-X functionality is disabled and only this operation increments the available count again. For non-managed interrupts the available count is the correct selection criterion because the guaranteed reservations need to be taken into account. Using the allocated counter could lead to a failing allocation in the following situation (total vector space of 10 assumed): CPU0 CPU1 available: 2 0 allocated: 5 3 <--- CPU1 is selected, but available space = 0 managed reserved: 3 7 while available yields the correct result. For managed interrupts the available count is not the appropriate selection criterion because as explained above the available count is not affected by the actual vector allocation. The following example illustrates that. Total vector space of 10 assumed. The starting point is: CPU0 CPU1 available: 5 4 allocated: 2 3 managed reserved: 3 3 Allocating vectors for three non-managed interrupts will result in affinitizing the first two to CPU0 and the third one to CPU1 because the available count is adjusted with each allocation: CPU0 CPU1 available: 5 4 <- Select CPU0 for 1st allocation --> allocated: 3 3 available: 4 4 <- Select CPU0 for 2nd allocation --> allocated: 4 3 available: 3 4 <- Select CPU1 for 3rd allocation --> allocated: 4 4 But the allocation of three managed interrupts starting from the same point will affinitize all of them to CPU0 because the available count is not affected by the allocation (see above). So the end result is: CPU0 CPU1 available: 5 4 allocated: 5 3 Introduce a "managed_allocated" field in struct cpumap to track the vector allocation for managed interrupts separately. Use this information to select the target CPU when a vector is allocated for a managed interrupt, which results in more evenly distributed vector assignments. The above example results in the following allocations: CPU0 CPU1 managed_allocated: 0 0 <- Select CPU0 for 1st allocation --> allocated: 3 3 managed_allocated: 1 0 <- Select CPU1 for 2nd allocation --> allocated: 3 4 managed_allocated: 1 1 <- Select CPU0 for 3rd allocation --> allocated: 4 4 The allocation of non-managed interrupts is not affected by this change and is still evaluating the available count. The overall distribution of interrupt vectors for both types of interrupts might still not be perfectly even depending on the number of non-managed and managed interrupts in a system, but due to the reservation guarantee for managed interrupts this cannot be avoided. Expose the new field in debugfs as well. [ tglx: Clarified the background of the problem in the changelog and described it independent of NVME ] Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Michael Kelley <mikelley@microsoft.com> Link: https://lkml.kernel.org/r/20181106040000.27316-1-longli@linuxonhyperv.com
2018-11-06 12:00:00 +08:00
unsigned int managed_allocated;
bool initialized;
genirq: Implement bitmap matrix allocator Implement the infrastructure for a simple bitmap based allocator, which will replace the x86 vector allocator. It's in the core code as other architectures might be able to reuse/extend it. For now it only implements allocations for single CPUs, but it's simple to add multi CPU allocation support if required. The concept is rather simple: Global information: system_vector bitmap global accounting PerCPU information: allocation bitmap managed allocation bitmap local accounting The system vector bitmap is used to exclude vectors system wide from the allocation space. The allocation bitmap is used to keep track of per cpu used vectors. The managed allocation bitmap is used to reserve vectors for managed interrupts. When a regular (non managed) interrupt allocation happens then the following rule applies: tmpmap = system_map | alloc_map | managed_map find_zero_bit(tmpmap) Oring the bitmaps together gives the real available space. The same rule applies for reserving a managed interrupt vector. But contrary to the regular interrupts the reservation only marks the bit in the managed map and therefor excludes it from the regular allocations. The managed map is only cleaned out when the a managed interrupt is completely released and it stays alive accross CPU offline/online operations. For managed interrupt allocations the rule is: tmpmap = managed_map & ~alloc_map find_first_bit(tmpmap) This returns the first bit which is in the managed map, but not yet allocated in the allocation map. The allocation marks it in the allocation map and hands it back to the caller for use. The rest of the code are helper functions to handle the various requirements and the accounting which are necessary to replace the x86 vector allocation code. The result is a single patch as the evolution of this infrastructure cannot be represented in bits and pieces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Yu Chen <yu.c.chen@intel.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/20170913213153.185437174@linutronix.de
2017-09-14 05:29:14 +08:00
bool online;
unsigned long alloc_map[IRQ_MATRIX_SIZE];
unsigned long managed_map[IRQ_MATRIX_SIZE];
};
struct irq_matrix {
unsigned int matrix_bits;
unsigned int alloc_start;
unsigned int alloc_end;
unsigned int alloc_size;
unsigned int global_available;
unsigned int global_reserved;
unsigned int systembits_inalloc;
unsigned int total_allocated;
unsigned int online_maps;
struct cpumap __percpu *maps;
unsigned long scratch_map[IRQ_MATRIX_SIZE];
unsigned long system_map[IRQ_MATRIX_SIZE];
};
#define CREATE_TRACE_POINTS
#include <trace/events/irq_matrix.h>
genirq: Implement bitmap matrix allocator Implement the infrastructure for a simple bitmap based allocator, which will replace the x86 vector allocator. It's in the core code as other architectures might be able to reuse/extend it. For now it only implements allocations for single CPUs, but it's simple to add multi CPU allocation support if required. The concept is rather simple: Global information: system_vector bitmap global accounting PerCPU information: allocation bitmap managed allocation bitmap local accounting The system vector bitmap is used to exclude vectors system wide from the allocation space. The allocation bitmap is used to keep track of per cpu used vectors. The managed allocation bitmap is used to reserve vectors for managed interrupts. When a regular (non managed) interrupt allocation happens then the following rule applies: tmpmap = system_map | alloc_map | managed_map find_zero_bit(tmpmap) Oring the bitmaps together gives the real available space. The same rule applies for reserving a managed interrupt vector. But contrary to the regular interrupts the reservation only marks the bit in the managed map and therefor excludes it from the regular allocations. The managed map is only cleaned out when the a managed interrupt is completely released and it stays alive accross CPU offline/online operations. For managed interrupt allocations the rule is: tmpmap = managed_map & ~alloc_map find_first_bit(tmpmap) This returns the first bit which is in the managed map, but not yet allocated in the allocation map. The allocation marks it in the allocation map and hands it back to the caller for use. The rest of the code are helper functions to handle the various requirements and the accounting which are necessary to replace the x86 vector allocation code. The result is a single patch as the evolution of this infrastructure cannot be represented in bits and pieces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Yu Chen <yu.c.chen@intel.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/20170913213153.185437174@linutronix.de
2017-09-14 05:29:14 +08:00
/**
* irq_alloc_matrix - Allocate a irq_matrix structure and initialize it
* @matrix_bits: Number of matrix bits must be <= IRQ_MATRIX_BITS
* @alloc_start: From which bit the allocation search starts
* @alloc_end: At which bit the allocation search ends, i.e first
* invalid bit
*/
__init struct irq_matrix *irq_alloc_matrix(unsigned int matrix_bits,
unsigned int alloc_start,
unsigned int alloc_end)
{
struct irq_matrix *m;
if (matrix_bits > IRQ_MATRIX_BITS)
return NULL;
m = kzalloc(sizeof(*m), GFP_KERNEL);
if (!m)
return NULL;
m->matrix_bits = matrix_bits;
m->alloc_start = alloc_start;
m->alloc_end = alloc_end;
m->alloc_size = alloc_end - alloc_start;
m->maps = alloc_percpu(*m->maps);
if (!m->maps) {
kfree(m);
return NULL;
}
return m;
}
/**
* irq_matrix_online - Bring the local CPU matrix online
* @m: Matrix pointer
*/
void irq_matrix_online(struct irq_matrix *m)
{
struct cpumap *cm = this_cpu_ptr(m->maps);
BUG_ON(cm->online);
if (!cm->initialized) {
cm->available = m->alloc_size;
cm->available -= cm->managed + m->systembits_inalloc;
cm->initialized = true;
}
genirq: Implement bitmap matrix allocator Implement the infrastructure for a simple bitmap based allocator, which will replace the x86 vector allocator. It's in the core code as other architectures might be able to reuse/extend it. For now it only implements allocations for single CPUs, but it's simple to add multi CPU allocation support if required. The concept is rather simple: Global information: system_vector bitmap global accounting PerCPU information: allocation bitmap managed allocation bitmap local accounting The system vector bitmap is used to exclude vectors system wide from the allocation space. The allocation bitmap is used to keep track of per cpu used vectors. The managed allocation bitmap is used to reserve vectors for managed interrupts. When a regular (non managed) interrupt allocation happens then the following rule applies: tmpmap = system_map | alloc_map | managed_map find_zero_bit(tmpmap) Oring the bitmaps together gives the real available space. The same rule applies for reserving a managed interrupt vector. But contrary to the regular interrupts the reservation only marks the bit in the managed map and therefor excludes it from the regular allocations. The managed map is only cleaned out when the a managed interrupt is completely released and it stays alive accross CPU offline/online operations. For managed interrupt allocations the rule is: tmpmap = managed_map & ~alloc_map find_first_bit(tmpmap) This returns the first bit which is in the managed map, but not yet allocated in the allocation map. The allocation marks it in the allocation map and hands it back to the caller for use. The rest of the code are helper functions to handle the various requirements and the accounting which are necessary to replace the x86 vector allocation code. The result is a single patch as the evolution of this infrastructure cannot be represented in bits and pieces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Yu Chen <yu.c.chen@intel.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/20170913213153.185437174@linutronix.de
2017-09-14 05:29:14 +08:00
m->global_available += cm->available;
cm->online = true;
m->online_maps++;
trace_irq_matrix_online(m);
genirq: Implement bitmap matrix allocator Implement the infrastructure for a simple bitmap based allocator, which will replace the x86 vector allocator. It's in the core code as other architectures might be able to reuse/extend it. For now it only implements allocations for single CPUs, but it's simple to add multi CPU allocation support if required. The concept is rather simple: Global information: system_vector bitmap global accounting PerCPU information: allocation bitmap managed allocation bitmap local accounting The system vector bitmap is used to exclude vectors system wide from the allocation space. The allocation bitmap is used to keep track of per cpu used vectors. The managed allocation bitmap is used to reserve vectors for managed interrupts. When a regular (non managed) interrupt allocation happens then the following rule applies: tmpmap = system_map | alloc_map | managed_map find_zero_bit(tmpmap) Oring the bitmaps together gives the real available space. The same rule applies for reserving a managed interrupt vector. But contrary to the regular interrupts the reservation only marks the bit in the managed map and therefor excludes it from the regular allocations. The managed map is only cleaned out when the a managed interrupt is completely released and it stays alive accross CPU offline/online operations. For managed interrupt allocations the rule is: tmpmap = managed_map & ~alloc_map find_first_bit(tmpmap) This returns the first bit which is in the managed map, but not yet allocated in the allocation map. The allocation marks it in the allocation map and hands it back to the caller for use. The rest of the code are helper functions to handle the various requirements and the accounting which are necessary to replace the x86 vector allocation code. The result is a single patch as the evolution of this infrastructure cannot be represented in bits and pieces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Yu Chen <yu.c.chen@intel.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/20170913213153.185437174@linutronix.de
2017-09-14 05:29:14 +08:00
}
/**
* irq_matrix_offline - Bring the local CPU matrix offline
* @m: Matrix pointer
*/
void irq_matrix_offline(struct irq_matrix *m)
{
struct cpumap *cm = this_cpu_ptr(m->maps);
/* Update the global available size */
m->global_available -= cm->available;
cm->online = false;
m->online_maps--;
trace_irq_matrix_offline(m);
genirq: Implement bitmap matrix allocator Implement the infrastructure for a simple bitmap based allocator, which will replace the x86 vector allocator. It's in the core code as other architectures might be able to reuse/extend it. For now it only implements allocations for single CPUs, but it's simple to add multi CPU allocation support if required. The concept is rather simple: Global information: system_vector bitmap global accounting PerCPU information: allocation bitmap managed allocation bitmap local accounting The system vector bitmap is used to exclude vectors system wide from the allocation space. The allocation bitmap is used to keep track of per cpu used vectors. The managed allocation bitmap is used to reserve vectors for managed interrupts. When a regular (non managed) interrupt allocation happens then the following rule applies: tmpmap = system_map | alloc_map | managed_map find_zero_bit(tmpmap) Oring the bitmaps together gives the real available space. The same rule applies for reserving a managed interrupt vector. But contrary to the regular interrupts the reservation only marks the bit in the managed map and therefor excludes it from the regular allocations. The managed map is only cleaned out when the a managed interrupt is completely released and it stays alive accross CPU offline/online operations. For managed interrupt allocations the rule is: tmpmap = managed_map & ~alloc_map find_first_bit(tmpmap) This returns the first bit which is in the managed map, but not yet allocated in the allocation map. The allocation marks it in the allocation map and hands it back to the caller for use. The rest of the code are helper functions to handle the various requirements and the accounting which are necessary to replace the x86 vector allocation code. The result is a single patch as the evolution of this infrastructure cannot be represented in bits and pieces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Yu Chen <yu.c.chen@intel.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/20170913213153.185437174@linutronix.de
2017-09-14 05:29:14 +08:00
}
static unsigned int matrix_alloc_area(struct irq_matrix *m, struct cpumap *cm,
unsigned int num, bool managed)
{
unsigned int area, start = m->alloc_start;
unsigned int end = m->alloc_end;
bitmap_or(m->scratch_map, cm->managed_map, m->system_map, end);
bitmap_or(m->scratch_map, m->scratch_map, cm->alloc_map, end);
area = bitmap_find_next_zero_area(m->scratch_map, end, start, num, 0);
if (area >= end)
return area;
if (managed)
bitmap_set(cm->managed_map, area, num);
else
bitmap_set(cm->alloc_map, area, num);
return area;
}
/* Find the best CPU which has the lowest vector allocation count */
static unsigned int matrix_find_best_cpu(struct irq_matrix *m,
const struct cpumask *msk)
{
unsigned int cpu, best_cpu, maxavl = 0;
struct cpumap *cm;
best_cpu = UINT_MAX;
for_each_cpu(cpu, msk) {
cm = per_cpu_ptr(m->maps, cpu);
if (!cm->online || cm->available <= maxavl)
continue;
best_cpu = cpu;
maxavl = cm->available;
}
return best_cpu;
}
genirq/matrix: Improve target CPU selection for managed interrupts. On large systems with multiple devices of the same class (e.g. NVMe disks, using managed interrupts), the kernel can affinitize these interrupts to a small subset of CPUs instead of spreading them out evenly. irq_matrix_alloc_managed() tries to select the CPU in the supplied cpumask of possible target CPUs which has the lowest number of interrupt vectors allocated. This is done by searching the CPU with the highest number of available vectors. While this is correct for non-managed CPUs it can select the wrong CPU for managed interrupts. Under certain constellations this results in affinitizing the managed interrupts of several devices to a single CPU in a set. The book keeping of available vectors works the following way: 1) Non-managed interrupts: available is decremented when the interrupt is actually requested by the device driver and a vector is assigned. It's incremented when the interrupt and the vector are freed. 2) Managed interrupts: Managed interrupts guarantee vector reservation when the MSI/MSI-X functionality of a device is enabled, which is achieved by reserving vectors in the bitmaps of the possible target CPUs. This reservation decrements the available count on each possible target CPU. When the interrupt is requested by the device driver then a vector is allocated from the reserved region. The operation is reversed when the interrupt is freed by the device driver. Neither of these operations affect the available count. The reservation persist up to the point where the MSI/MSI-X functionality is disabled and only this operation increments the available count again. For non-managed interrupts the available count is the correct selection criterion because the guaranteed reservations need to be taken into account. Using the allocated counter could lead to a failing allocation in the following situation (total vector space of 10 assumed): CPU0 CPU1 available: 2 0 allocated: 5 3 <--- CPU1 is selected, but available space = 0 managed reserved: 3 7 while available yields the correct result. For managed interrupts the available count is not the appropriate selection criterion because as explained above the available count is not affected by the actual vector allocation. The following example illustrates that. Total vector space of 10 assumed. The starting point is: CPU0 CPU1 available: 5 4 allocated: 2 3 managed reserved: 3 3 Allocating vectors for three non-managed interrupts will result in affinitizing the first two to CPU0 and the third one to CPU1 because the available count is adjusted with each allocation: CPU0 CPU1 available: 5 4 <- Select CPU0 for 1st allocation --> allocated: 3 3 available: 4 4 <- Select CPU0 for 2nd allocation --> allocated: 4 3 available: 3 4 <- Select CPU1 for 3rd allocation --> allocated: 4 4 But the allocation of three managed interrupts starting from the same point will affinitize all of them to CPU0 because the available count is not affected by the allocation (see above). So the end result is: CPU0 CPU1 available: 5 4 allocated: 5 3 Introduce a "managed_allocated" field in struct cpumap to track the vector allocation for managed interrupts separately. Use this information to select the target CPU when a vector is allocated for a managed interrupt, which results in more evenly distributed vector assignments. The above example results in the following allocations: CPU0 CPU1 managed_allocated: 0 0 <- Select CPU0 for 1st allocation --> allocated: 3 3 managed_allocated: 1 0 <- Select CPU1 for 2nd allocation --> allocated: 3 4 managed_allocated: 1 1 <- Select CPU0 for 3rd allocation --> allocated: 4 4 The allocation of non-managed interrupts is not affected by this change and is still evaluating the available count. The overall distribution of interrupt vectors for both types of interrupts might still not be perfectly even depending on the number of non-managed and managed interrupts in a system, but due to the reservation guarantee for managed interrupts this cannot be avoided. Expose the new field in debugfs as well. [ tglx: Clarified the background of the problem in the changelog and described it independent of NVME ] Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Michael Kelley <mikelley@microsoft.com> Link: https://lkml.kernel.org/r/20181106040000.27316-1-longli@linuxonhyperv.com
2018-11-06 12:00:00 +08:00
/* Find the best CPU which has the lowest number of managed IRQs allocated */
static unsigned int matrix_find_best_cpu_managed(struct irq_matrix *m,
const struct cpumask *msk)
{
unsigned int cpu, best_cpu, allocated = UINT_MAX;
struct cpumap *cm;
best_cpu = UINT_MAX;
for_each_cpu(cpu, msk) {
cm = per_cpu_ptr(m->maps, cpu);
if (!cm->online || cm->managed_allocated > allocated)
continue;
best_cpu = cpu;
allocated = cm->managed_allocated;
}
return best_cpu;
}
genirq: Implement bitmap matrix allocator Implement the infrastructure for a simple bitmap based allocator, which will replace the x86 vector allocator. It's in the core code as other architectures might be able to reuse/extend it. For now it only implements allocations for single CPUs, but it's simple to add multi CPU allocation support if required. The concept is rather simple: Global information: system_vector bitmap global accounting PerCPU information: allocation bitmap managed allocation bitmap local accounting The system vector bitmap is used to exclude vectors system wide from the allocation space. The allocation bitmap is used to keep track of per cpu used vectors. The managed allocation bitmap is used to reserve vectors for managed interrupts. When a regular (non managed) interrupt allocation happens then the following rule applies: tmpmap = system_map | alloc_map | managed_map find_zero_bit(tmpmap) Oring the bitmaps together gives the real available space. The same rule applies for reserving a managed interrupt vector. But contrary to the regular interrupts the reservation only marks the bit in the managed map and therefor excludes it from the regular allocations. The managed map is only cleaned out when the a managed interrupt is completely released and it stays alive accross CPU offline/online operations. For managed interrupt allocations the rule is: tmpmap = managed_map & ~alloc_map find_first_bit(tmpmap) This returns the first bit which is in the managed map, but not yet allocated in the allocation map. The allocation marks it in the allocation map and hands it back to the caller for use. The rest of the code are helper functions to handle the various requirements and the accounting which are necessary to replace the x86 vector allocation code. The result is a single patch as the evolution of this infrastructure cannot be represented in bits and pieces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Yu Chen <yu.c.chen@intel.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/20170913213153.185437174@linutronix.de
2017-09-14 05:29:14 +08:00
/**
* irq_matrix_assign_system - Assign system wide entry in the matrix
* @m: Matrix pointer
* @bit: Which bit to reserve
* @replace: Replace an already allocated vector with a system
* vector at the same bit position.
*
* The BUG_ON()s below are on purpose. If this goes wrong in the
* early boot process, then the chance to survive is about zero.
* If this happens when the system is life, it's not much better.
*/
void irq_matrix_assign_system(struct irq_matrix *m, unsigned int bit,
bool replace)
{
struct cpumap *cm = this_cpu_ptr(m->maps);
BUG_ON(bit > m->matrix_bits);
BUG_ON(m->online_maps > 1 || (m->online_maps && !replace));
set_bit(bit, m->system_map);
if (replace) {
BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
cm->allocated--;
m->total_allocated--;
}
if (bit >= m->alloc_start && bit < m->alloc_end)
m->systembits_inalloc++;
trace_irq_matrix_assign_system(bit, m);
genirq: Implement bitmap matrix allocator Implement the infrastructure for a simple bitmap based allocator, which will replace the x86 vector allocator. It's in the core code as other architectures might be able to reuse/extend it. For now it only implements allocations for single CPUs, but it's simple to add multi CPU allocation support if required. The concept is rather simple: Global information: system_vector bitmap global accounting PerCPU information: allocation bitmap managed allocation bitmap local accounting The system vector bitmap is used to exclude vectors system wide from the allocation space. The allocation bitmap is used to keep track of per cpu used vectors. The managed allocation bitmap is used to reserve vectors for managed interrupts. When a regular (non managed) interrupt allocation happens then the following rule applies: tmpmap = system_map | alloc_map | managed_map find_zero_bit(tmpmap) Oring the bitmaps together gives the real available space. The same rule applies for reserving a managed interrupt vector. But contrary to the regular interrupts the reservation only marks the bit in the managed map and therefor excludes it from the regular allocations. The managed map is only cleaned out when the a managed interrupt is completely released and it stays alive accross CPU offline/online operations. For managed interrupt allocations the rule is: tmpmap = managed_map & ~alloc_map find_first_bit(tmpmap) This returns the first bit which is in the managed map, but not yet allocated in the allocation map. The allocation marks it in the allocation map and hands it back to the caller for use. The rest of the code are helper functions to handle the various requirements and the accounting which are necessary to replace the x86 vector allocation code. The result is a single patch as the evolution of this infrastructure cannot be represented in bits and pieces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Yu Chen <yu.c.chen@intel.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/20170913213153.185437174@linutronix.de
2017-09-14 05:29:14 +08:00
}
/**
* irq_matrix_reserve_managed - Reserve a managed interrupt in a CPU map
* @m: Matrix pointer
* @msk: On which CPUs the bits should be reserved.
*
* Can be called for offline CPUs. Note, this will only reserve one bit
* on all CPUs in @msk, but it's not guaranteed that the bits are at the
* same offset on all CPUs
*/
int irq_matrix_reserve_managed(struct irq_matrix *m, const struct cpumask *msk)
{
unsigned int cpu, failed_cpu;
for_each_cpu(cpu, msk) {
struct cpumap *cm = per_cpu_ptr(m->maps, cpu);
unsigned int bit;
bit = matrix_alloc_area(m, cm, 1, true);
if (bit >= m->alloc_end)
goto cleanup;
cm->managed++;
if (cm->online) {
cm->available--;
m->global_available--;
}
trace_irq_matrix_reserve_managed(bit, cpu, m, cm);
genirq: Implement bitmap matrix allocator Implement the infrastructure for a simple bitmap based allocator, which will replace the x86 vector allocator. It's in the core code as other architectures might be able to reuse/extend it. For now it only implements allocations for single CPUs, but it's simple to add multi CPU allocation support if required. The concept is rather simple: Global information: system_vector bitmap global accounting PerCPU information: allocation bitmap managed allocation bitmap local accounting The system vector bitmap is used to exclude vectors system wide from the allocation space. The allocation bitmap is used to keep track of per cpu used vectors. The managed allocation bitmap is used to reserve vectors for managed interrupts. When a regular (non managed) interrupt allocation happens then the following rule applies: tmpmap = system_map | alloc_map | managed_map find_zero_bit(tmpmap) Oring the bitmaps together gives the real available space. The same rule applies for reserving a managed interrupt vector. But contrary to the regular interrupts the reservation only marks the bit in the managed map and therefor excludes it from the regular allocations. The managed map is only cleaned out when the a managed interrupt is completely released and it stays alive accross CPU offline/online operations. For managed interrupt allocations the rule is: tmpmap = managed_map & ~alloc_map find_first_bit(tmpmap) This returns the first bit which is in the managed map, but not yet allocated in the allocation map. The allocation marks it in the allocation map and hands it back to the caller for use. The rest of the code are helper functions to handle the various requirements and the accounting which are necessary to replace the x86 vector allocation code. The result is a single patch as the evolution of this infrastructure cannot be represented in bits and pieces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Yu Chen <yu.c.chen@intel.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/20170913213153.185437174@linutronix.de
2017-09-14 05:29:14 +08:00
}
return 0;
cleanup:
failed_cpu = cpu;
for_each_cpu(cpu, msk) {
if (cpu == failed_cpu)
break;
irq_matrix_remove_managed(m, cpumask_of(cpu));
}
return -ENOSPC;
}
/**
* irq_matrix_remove_managed - Remove managed interrupts in a CPU map
* @m: Matrix pointer
* @msk: On which CPUs the bits should be removed
*
* Can be called for offline CPUs
*
* This removes not allocated managed interrupts from the map. It does
* not matter which one because the managed interrupts free their
* allocation when they shut down. If not, the accounting is screwed,
* but all what can be done at this point is warn about it.
*/
void irq_matrix_remove_managed(struct irq_matrix *m, const struct cpumask *msk)
{
unsigned int cpu;
for_each_cpu(cpu, msk) {
struct cpumap *cm = per_cpu_ptr(m->maps, cpu);
unsigned int bit, end = m->alloc_end;
if (WARN_ON_ONCE(!cm->managed))
continue;
/* Get managed bit which are not allocated */
bitmap_andnot(m->scratch_map, cm->managed_map, cm->alloc_map, end);
bit = find_first_bit(m->scratch_map, end);
if (WARN_ON_ONCE(bit >= end))
continue;
clear_bit(bit, cm->managed_map);
cm->managed--;
if (cm->online) {
cm->available++;
m->global_available++;
}
trace_irq_matrix_remove_managed(bit, cpu, m, cm);
genirq: Implement bitmap matrix allocator Implement the infrastructure for a simple bitmap based allocator, which will replace the x86 vector allocator. It's in the core code as other architectures might be able to reuse/extend it. For now it only implements allocations for single CPUs, but it's simple to add multi CPU allocation support if required. The concept is rather simple: Global information: system_vector bitmap global accounting PerCPU information: allocation bitmap managed allocation bitmap local accounting The system vector bitmap is used to exclude vectors system wide from the allocation space. The allocation bitmap is used to keep track of per cpu used vectors. The managed allocation bitmap is used to reserve vectors for managed interrupts. When a regular (non managed) interrupt allocation happens then the following rule applies: tmpmap = system_map | alloc_map | managed_map find_zero_bit(tmpmap) Oring the bitmaps together gives the real available space. The same rule applies for reserving a managed interrupt vector. But contrary to the regular interrupts the reservation only marks the bit in the managed map and therefor excludes it from the regular allocations. The managed map is only cleaned out when the a managed interrupt is completely released and it stays alive accross CPU offline/online operations. For managed interrupt allocations the rule is: tmpmap = managed_map & ~alloc_map find_first_bit(tmpmap) This returns the first bit which is in the managed map, but not yet allocated in the allocation map. The allocation marks it in the allocation map and hands it back to the caller for use. The rest of the code are helper functions to handle the various requirements and the accounting which are necessary to replace the x86 vector allocation code. The result is a single patch as the evolution of this infrastructure cannot be represented in bits and pieces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Yu Chen <yu.c.chen@intel.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/20170913213153.185437174@linutronix.de
2017-09-14 05:29:14 +08:00
}
}
/**
* irq_matrix_alloc_managed - Allocate a managed interrupt in a CPU map
* @m: Matrix pointer
* @cpu: On which CPU the interrupt should be allocated
*/
int irq_matrix_alloc_managed(struct irq_matrix *m, const struct cpumask *msk,
unsigned int *mapped_cpu)
genirq: Implement bitmap matrix allocator Implement the infrastructure for a simple bitmap based allocator, which will replace the x86 vector allocator. It's in the core code as other architectures might be able to reuse/extend it. For now it only implements allocations for single CPUs, but it's simple to add multi CPU allocation support if required. The concept is rather simple: Global information: system_vector bitmap global accounting PerCPU information: allocation bitmap managed allocation bitmap local accounting The system vector bitmap is used to exclude vectors system wide from the allocation space. The allocation bitmap is used to keep track of per cpu used vectors. The managed allocation bitmap is used to reserve vectors for managed interrupts. When a regular (non managed) interrupt allocation happens then the following rule applies: tmpmap = system_map | alloc_map | managed_map find_zero_bit(tmpmap) Oring the bitmaps together gives the real available space. The same rule applies for reserving a managed interrupt vector. But contrary to the regular interrupts the reservation only marks the bit in the managed map and therefor excludes it from the regular allocations. The managed map is only cleaned out when the a managed interrupt is completely released and it stays alive accross CPU offline/online operations. For managed interrupt allocations the rule is: tmpmap = managed_map & ~alloc_map find_first_bit(tmpmap) This returns the first bit which is in the managed map, but not yet allocated in the allocation map. The allocation marks it in the allocation map and hands it back to the caller for use. The rest of the code are helper functions to handle the various requirements and the accounting which are necessary to replace the x86 vector allocation code. The result is a single patch as the evolution of this infrastructure cannot be represented in bits and pieces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Yu Chen <yu.c.chen@intel.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/20170913213153.185437174@linutronix.de
2017-09-14 05:29:14 +08:00
{
unsigned int bit, cpu, end = m->alloc_end;
struct cpumap *cm;
if (cpumask_empty(msk))
return -EINVAL;
genirq/matrix: Improve target CPU selection for managed interrupts. On large systems with multiple devices of the same class (e.g. NVMe disks, using managed interrupts), the kernel can affinitize these interrupts to a small subset of CPUs instead of spreading them out evenly. irq_matrix_alloc_managed() tries to select the CPU in the supplied cpumask of possible target CPUs which has the lowest number of interrupt vectors allocated. This is done by searching the CPU with the highest number of available vectors. While this is correct for non-managed CPUs it can select the wrong CPU for managed interrupts. Under certain constellations this results in affinitizing the managed interrupts of several devices to a single CPU in a set. The book keeping of available vectors works the following way: 1) Non-managed interrupts: available is decremented when the interrupt is actually requested by the device driver and a vector is assigned. It's incremented when the interrupt and the vector are freed. 2) Managed interrupts: Managed interrupts guarantee vector reservation when the MSI/MSI-X functionality of a device is enabled, which is achieved by reserving vectors in the bitmaps of the possible target CPUs. This reservation decrements the available count on each possible target CPU. When the interrupt is requested by the device driver then a vector is allocated from the reserved region. The operation is reversed when the interrupt is freed by the device driver. Neither of these operations affect the available count. The reservation persist up to the point where the MSI/MSI-X functionality is disabled and only this operation increments the available count again. For non-managed interrupts the available count is the correct selection criterion because the guaranteed reservations need to be taken into account. Using the allocated counter could lead to a failing allocation in the following situation (total vector space of 10 assumed): CPU0 CPU1 available: 2 0 allocated: 5 3 <--- CPU1 is selected, but available space = 0 managed reserved: 3 7 while available yields the correct result. For managed interrupts the available count is not the appropriate selection criterion because as explained above the available count is not affected by the actual vector allocation. The following example illustrates that. Total vector space of 10 assumed. The starting point is: CPU0 CPU1 available: 5 4 allocated: 2 3 managed reserved: 3 3 Allocating vectors for three non-managed interrupts will result in affinitizing the first two to CPU0 and the third one to CPU1 because the available count is adjusted with each allocation: CPU0 CPU1 available: 5 4 <- Select CPU0 for 1st allocation --> allocated: 3 3 available: 4 4 <- Select CPU0 for 2nd allocation --> allocated: 4 3 available: 3 4 <- Select CPU1 for 3rd allocation --> allocated: 4 4 But the allocation of three managed interrupts starting from the same point will affinitize all of them to CPU0 because the available count is not affected by the allocation (see above). So the end result is: CPU0 CPU1 available: 5 4 allocated: 5 3 Introduce a "managed_allocated" field in struct cpumap to track the vector allocation for managed interrupts separately. Use this information to select the target CPU when a vector is allocated for a managed interrupt, which results in more evenly distributed vector assignments. The above example results in the following allocations: CPU0 CPU1 managed_allocated: 0 0 <- Select CPU0 for 1st allocation --> allocated: 3 3 managed_allocated: 1 0 <- Select CPU1 for 2nd allocation --> allocated: 3 4 managed_allocated: 1 1 <- Select CPU0 for 3rd allocation --> allocated: 4 4 The allocation of non-managed interrupts is not affected by this change and is still evaluating the available count. The overall distribution of interrupt vectors for both types of interrupts might still not be perfectly even depending on the number of non-managed and managed interrupts in a system, but due to the reservation guarantee for managed interrupts this cannot be avoided. Expose the new field in debugfs as well. [ tglx: Clarified the background of the problem in the changelog and described it independent of NVME ] Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Michael Kelley <mikelley@microsoft.com> Link: https://lkml.kernel.org/r/20181106040000.27316-1-longli@linuxonhyperv.com
2018-11-06 12:00:00 +08:00
cpu = matrix_find_best_cpu_managed(m, msk);
if (cpu == UINT_MAX)
return -ENOSPC;
genirq: Implement bitmap matrix allocator Implement the infrastructure for a simple bitmap based allocator, which will replace the x86 vector allocator. It's in the core code as other architectures might be able to reuse/extend it. For now it only implements allocations for single CPUs, but it's simple to add multi CPU allocation support if required. The concept is rather simple: Global information: system_vector bitmap global accounting PerCPU information: allocation bitmap managed allocation bitmap local accounting The system vector bitmap is used to exclude vectors system wide from the allocation space. The allocation bitmap is used to keep track of per cpu used vectors. The managed allocation bitmap is used to reserve vectors for managed interrupts. When a regular (non managed) interrupt allocation happens then the following rule applies: tmpmap = system_map | alloc_map | managed_map find_zero_bit(tmpmap) Oring the bitmaps together gives the real available space. The same rule applies for reserving a managed interrupt vector. But contrary to the regular interrupts the reservation only marks the bit in the managed map and therefor excludes it from the regular allocations. The managed map is only cleaned out when the a managed interrupt is completely released and it stays alive accross CPU offline/online operations. For managed interrupt allocations the rule is: tmpmap = managed_map & ~alloc_map find_first_bit(tmpmap) This returns the first bit which is in the managed map, but not yet allocated in the allocation map. The allocation marks it in the allocation map and hands it back to the caller for use. The rest of the code are helper functions to handle the various requirements and the accounting which are necessary to replace the x86 vector allocation code. The result is a single patch as the evolution of this infrastructure cannot be represented in bits and pieces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Yu Chen <yu.c.chen@intel.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/20170913213153.185437174@linutronix.de
2017-09-14 05:29:14 +08:00
cm = per_cpu_ptr(m->maps, cpu);
end = m->alloc_end;
genirq: Implement bitmap matrix allocator Implement the infrastructure for a simple bitmap based allocator, which will replace the x86 vector allocator. It's in the core code as other architectures might be able to reuse/extend it. For now it only implements allocations for single CPUs, but it's simple to add multi CPU allocation support if required. The concept is rather simple: Global information: system_vector bitmap global accounting PerCPU information: allocation bitmap managed allocation bitmap local accounting The system vector bitmap is used to exclude vectors system wide from the allocation space. The allocation bitmap is used to keep track of per cpu used vectors. The managed allocation bitmap is used to reserve vectors for managed interrupts. When a regular (non managed) interrupt allocation happens then the following rule applies: tmpmap = system_map | alloc_map | managed_map find_zero_bit(tmpmap) Oring the bitmaps together gives the real available space. The same rule applies for reserving a managed interrupt vector. But contrary to the regular interrupts the reservation only marks the bit in the managed map and therefor excludes it from the regular allocations. The managed map is only cleaned out when the a managed interrupt is completely released and it stays alive accross CPU offline/online operations. For managed interrupt allocations the rule is: tmpmap = managed_map & ~alloc_map find_first_bit(tmpmap) This returns the first bit which is in the managed map, but not yet allocated in the allocation map. The allocation marks it in the allocation map and hands it back to the caller for use. The rest of the code are helper functions to handle the various requirements and the accounting which are necessary to replace the x86 vector allocation code. The result is a single patch as the evolution of this infrastructure cannot be represented in bits and pieces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Yu Chen <yu.c.chen@intel.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/20170913213153.185437174@linutronix.de
2017-09-14 05:29:14 +08:00
/* Get managed bit which are not allocated */
bitmap_andnot(m->scratch_map, cm->managed_map, cm->alloc_map, end);
bit = find_first_bit(m->scratch_map, end);
if (bit >= end)
return -ENOSPC;
set_bit(bit, cm->alloc_map);
cm->allocated++;
genirq/matrix: Improve target CPU selection for managed interrupts. On large systems with multiple devices of the same class (e.g. NVMe disks, using managed interrupts), the kernel can affinitize these interrupts to a small subset of CPUs instead of spreading them out evenly. irq_matrix_alloc_managed() tries to select the CPU in the supplied cpumask of possible target CPUs which has the lowest number of interrupt vectors allocated. This is done by searching the CPU with the highest number of available vectors. While this is correct for non-managed CPUs it can select the wrong CPU for managed interrupts. Under certain constellations this results in affinitizing the managed interrupts of several devices to a single CPU in a set. The book keeping of available vectors works the following way: 1) Non-managed interrupts: available is decremented when the interrupt is actually requested by the device driver and a vector is assigned. It's incremented when the interrupt and the vector are freed. 2) Managed interrupts: Managed interrupts guarantee vector reservation when the MSI/MSI-X functionality of a device is enabled, which is achieved by reserving vectors in the bitmaps of the possible target CPUs. This reservation decrements the available count on each possible target CPU. When the interrupt is requested by the device driver then a vector is allocated from the reserved region. The operation is reversed when the interrupt is freed by the device driver. Neither of these operations affect the available count. The reservation persist up to the point where the MSI/MSI-X functionality is disabled and only this operation increments the available count again. For non-managed interrupts the available count is the correct selection criterion because the guaranteed reservations need to be taken into account. Using the allocated counter could lead to a failing allocation in the following situation (total vector space of 10 assumed): CPU0 CPU1 available: 2 0 allocated: 5 3 <--- CPU1 is selected, but available space = 0 managed reserved: 3 7 while available yields the correct result. For managed interrupts the available count is not the appropriate selection criterion because as explained above the available count is not affected by the actual vector allocation. The following example illustrates that. Total vector space of 10 assumed. The starting point is: CPU0 CPU1 available: 5 4 allocated: 2 3 managed reserved: 3 3 Allocating vectors for three non-managed interrupts will result in affinitizing the first two to CPU0 and the third one to CPU1 because the available count is adjusted with each allocation: CPU0 CPU1 available: 5 4 <- Select CPU0 for 1st allocation --> allocated: 3 3 available: 4 4 <- Select CPU0 for 2nd allocation --> allocated: 4 3 available: 3 4 <- Select CPU1 for 3rd allocation --> allocated: 4 4 But the allocation of three managed interrupts starting from the same point will affinitize all of them to CPU0 because the available count is not affected by the allocation (see above). So the end result is: CPU0 CPU1 available: 5 4 allocated: 5 3 Introduce a "managed_allocated" field in struct cpumap to track the vector allocation for managed interrupts separately. Use this information to select the target CPU when a vector is allocated for a managed interrupt, which results in more evenly distributed vector assignments. The above example results in the following allocations: CPU0 CPU1 managed_allocated: 0 0 <- Select CPU0 for 1st allocation --> allocated: 3 3 managed_allocated: 1 0 <- Select CPU1 for 2nd allocation --> allocated: 3 4 managed_allocated: 1 1 <- Select CPU0 for 3rd allocation --> allocated: 4 4 The allocation of non-managed interrupts is not affected by this change and is still evaluating the available count. The overall distribution of interrupt vectors for both types of interrupts might still not be perfectly even depending on the number of non-managed and managed interrupts in a system, but due to the reservation guarantee for managed interrupts this cannot be avoided. Expose the new field in debugfs as well. [ tglx: Clarified the background of the problem in the changelog and described it independent of NVME ] Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Michael Kelley <mikelley@microsoft.com> Link: https://lkml.kernel.org/r/20181106040000.27316-1-longli@linuxonhyperv.com
2018-11-06 12:00:00 +08:00
cm->managed_allocated++;
genirq: Implement bitmap matrix allocator Implement the infrastructure for a simple bitmap based allocator, which will replace the x86 vector allocator. It's in the core code as other architectures might be able to reuse/extend it. For now it only implements allocations for single CPUs, but it's simple to add multi CPU allocation support if required. The concept is rather simple: Global information: system_vector bitmap global accounting PerCPU information: allocation bitmap managed allocation bitmap local accounting The system vector bitmap is used to exclude vectors system wide from the allocation space. The allocation bitmap is used to keep track of per cpu used vectors. The managed allocation bitmap is used to reserve vectors for managed interrupts. When a regular (non managed) interrupt allocation happens then the following rule applies: tmpmap = system_map | alloc_map | managed_map find_zero_bit(tmpmap) Oring the bitmaps together gives the real available space. The same rule applies for reserving a managed interrupt vector. But contrary to the regular interrupts the reservation only marks the bit in the managed map and therefor excludes it from the regular allocations. The managed map is only cleaned out when the a managed interrupt is completely released and it stays alive accross CPU offline/online operations. For managed interrupt allocations the rule is: tmpmap = managed_map & ~alloc_map find_first_bit(tmpmap) This returns the first bit which is in the managed map, but not yet allocated in the allocation map. The allocation marks it in the allocation map and hands it back to the caller for use. The rest of the code are helper functions to handle the various requirements and the accounting which are necessary to replace the x86 vector allocation code. The result is a single patch as the evolution of this infrastructure cannot be represented in bits and pieces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Yu Chen <yu.c.chen@intel.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/20170913213153.185437174@linutronix.de
2017-09-14 05:29:14 +08:00
m->total_allocated++;
*mapped_cpu = cpu;
trace_irq_matrix_alloc_managed(bit, cpu, m, cm);
genirq: Implement bitmap matrix allocator Implement the infrastructure for a simple bitmap based allocator, which will replace the x86 vector allocator. It's in the core code as other architectures might be able to reuse/extend it. For now it only implements allocations for single CPUs, but it's simple to add multi CPU allocation support if required. The concept is rather simple: Global information: system_vector bitmap global accounting PerCPU information: allocation bitmap managed allocation bitmap local accounting The system vector bitmap is used to exclude vectors system wide from the allocation space. The allocation bitmap is used to keep track of per cpu used vectors. The managed allocation bitmap is used to reserve vectors for managed interrupts. When a regular (non managed) interrupt allocation happens then the following rule applies: tmpmap = system_map | alloc_map | managed_map find_zero_bit(tmpmap) Oring the bitmaps together gives the real available space. The same rule applies for reserving a managed interrupt vector. But contrary to the regular interrupts the reservation only marks the bit in the managed map and therefor excludes it from the regular allocations. The managed map is only cleaned out when the a managed interrupt is completely released and it stays alive accross CPU offline/online operations. For managed interrupt allocations the rule is: tmpmap = managed_map & ~alloc_map find_first_bit(tmpmap) This returns the first bit which is in the managed map, but not yet allocated in the allocation map. The allocation marks it in the allocation map and hands it back to the caller for use. The rest of the code are helper functions to handle the various requirements and the accounting which are necessary to replace the x86 vector allocation code. The result is a single patch as the evolution of this infrastructure cannot be represented in bits and pieces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Yu Chen <yu.c.chen@intel.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/20170913213153.185437174@linutronix.de
2017-09-14 05:29:14 +08:00
return bit;
}
/**
* irq_matrix_assign - Assign a preallocated interrupt in the local CPU map
* @m: Matrix pointer
* @bit: Which bit to mark
*
* This should only be used to mark preallocated vectors
*/
void irq_matrix_assign(struct irq_matrix *m, unsigned int bit)
{
struct cpumap *cm = this_cpu_ptr(m->maps);
if (WARN_ON_ONCE(bit < m->alloc_start || bit >= m->alloc_end))
return;
if (WARN_ON_ONCE(test_and_set_bit(bit, cm->alloc_map)))
return;
cm->allocated++;
m->total_allocated++;
cm->available--;
m->global_available--;
trace_irq_matrix_assign(bit, smp_processor_id(), m, cm);
genirq: Implement bitmap matrix allocator Implement the infrastructure for a simple bitmap based allocator, which will replace the x86 vector allocator. It's in the core code as other architectures might be able to reuse/extend it. For now it only implements allocations for single CPUs, but it's simple to add multi CPU allocation support if required. The concept is rather simple: Global information: system_vector bitmap global accounting PerCPU information: allocation bitmap managed allocation bitmap local accounting The system vector bitmap is used to exclude vectors system wide from the allocation space. The allocation bitmap is used to keep track of per cpu used vectors. The managed allocation bitmap is used to reserve vectors for managed interrupts. When a regular (non managed) interrupt allocation happens then the following rule applies: tmpmap = system_map | alloc_map | managed_map find_zero_bit(tmpmap) Oring the bitmaps together gives the real available space. The same rule applies for reserving a managed interrupt vector. But contrary to the regular interrupts the reservation only marks the bit in the managed map and therefor excludes it from the regular allocations. The managed map is only cleaned out when the a managed interrupt is completely released and it stays alive accross CPU offline/online operations. For managed interrupt allocations the rule is: tmpmap = managed_map & ~alloc_map find_first_bit(tmpmap) This returns the first bit which is in the managed map, but not yet allocated in the allocation map. The allocation marks it in the allocation map and hands it back to the caller for use. The rest of the code are helper functions to handle the various requirements and the accounting which are necessary to replace the x86 vector allocation code. The result is a single patch as the evolution of this infrastructure cannot be represented in bits and pieces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Yu Chen <yu.c.chen@intel.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/20170913213153.185437174@linutronix.de
2017-09-14 05:29:14 +08:00
}
/**
* irq_matrix_reserve - Reserve interrupts
* @m: Matrix pointer
*
* This is merily a book keeping call. It increments the number of globally
* reserved interrupt bits w/o actually allocating them. This allows to
* setup interrupt descriptors w/o assigning low level resources to it.
* The actual allocation happens when the interrupt gets activated.
*/
void irq_matrix_reserve(struct irq_matrix *m)
{
if (m->global_reserved <= m->global_available &&
m->global_reserved + 1 > m->global_available)
pr_warn("Interrupt reservation exceeds available resources\n");
m->global_reserved++;
trace_irq_matrix_reserve(m);
genirq: Implement bitmap matrix allocator Implement the infrastructure for a simple bitmap based allocator, which will replace the x86 vector allocator. It's in the core code as other architectures might be able to reuse/extend it. For now it only implements allocations for single CPUs, but it's simple to add multi CPU allocation support if required. The concept is rather simple: Global information: system_vector bitmap global accounting PerCPU information: allocation bitmap managed allocation bitmap local accounting The system vector bitmap is used to exclude vectors system wide from the allocation space. The allocation bitmap is used to keep track of per cpu used vectors. The managed allocation bitmap is used to reserve vectors for managed interrupts. When a regular (non managed) interrupt allocation happens then the following rule applies: tmpmap = system_map | alloc_map | managed_map find_zero_bit(tmpmap) Oring the bitmaps together gives the real available space. The same rule applies for reserving a managed interrupt vector. But contrary to the regular interrupts the reservation only marks the bit in the managed map and therefor excludes it from the regular allocations. The managed map is only cleaned out when the a managed interrupt is completely released and it stays alive accross CPU offline/online operations. For managed interrupt allocations the rule is: tmpmap = managed_map & ~alloc_map find_first_bit(tmpmap) This returns the first bit which is in the managed map, but not yet allocated in the allocation map. The allocation marks it in the allocation map and hands it back to the caller for use. The rest of the code are helper functions to handle the various requirements and the accounting which are necessary to replace the x86 vector allocation code. The result is a single patch as the evolution of this infrastructure cannot be represented in bits and pieces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Yu Chen <yu.c.chen@intel.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/20170913213153.185437174@linutronix.de
2017-09-14 05:29:14 +08:00
}
/**
* irq_matrix_remove_reserved - Remove interrupt reservation
* @m: Matrix pointer
*
* This is merily a book keeping call. It decrements the number of globally
* reserved interrupt bits. This is used to undo irq_matrix_reserve() when the
* interrupt was never in use and a real vector allocated, which undid the
* reservation.
*/
void irq_matrix_remove_reserved(struct irq_matrix *m)
{
m->global_reserved--;
trace_irq_matrix_remove_reserved(m);
genirq: Implement bitmap matrix allocator Implement the infrastructure for a simple bitmap based allocator, which will replace the x86 vector allocator. It's in the core code as other architectures might be able to reuse/extend it. For now it only implements allocations for single CPUs, but it's simple to add multi CPU allocation support if required. The concept is rather simple: Global information: system_vector bitmap global accounting PerCPU information: allocation bitmap managed allocation bitmap local accounting The system vector bitmap is used to exclude vectors system wide from the allocation space. The allocation bitmap is used to keep track of per cpu used vectors. The managed allocation bitmap is used to reserve vectors for managed interrupts. When a regular (non managed) interrupt allocation happens then the following rule applies: tmpmap = system_map | alloc_map | managed_map find_zero_bit(tmpmap) Oring the bitmaps together gives the real available space. The same rule applies for reserving a managed interrupt vector. But contrary to the regular interrupts the reservation only marks the bit in the managed map and therefor excludes it from the regular allocations. The managed map is only cleaned out when the a managed interrupt is completely released and it stays alive accross CPU offline/online operations. For managed interrupt allocations the rule is: tmpmap = managed_map & ~alloc_map find_first_bit(tmpmap) This returns the first bit which is in the managed map, but not yet allocated in the allocation map. The allocation marks it in the allocation map and hands it back to the caller for use. The rest of the code are helper functions to handle the various requirements and the accounting which are necessary to replace the x86 vector allocation code. The result is a single patch as the evolution of this infrastructure cannot be represented in bits and pieces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Yu Chen <yu.c.chen@intel.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/20170913213153.185437174@linutronix.de
2017-09-14 05:29:14 +08:00
}
/**
* irq_matrix_alloc - Allocate a regular interrupt in a CPU map
* @m: Matrix pointer
* @msk: Which CPUs to search in
* @reserved: Allocate previously reserved interrupts
* @mapped_cpu: Pointer to store the CPU for which the irq was allocated
*/
int irq_matrix_alloc(struct irq_matrix *m, const struct cpumask *msk,
bool reserved, unsigned int *mapped_cpu)
{
unsigned int cpu, bit;
struct cpumap *cm;
genirq: Implement bitmap matrix allocator Implement the infrastructure for a simple bitmap based allocator, which will replace the x86 vector allocator. It's in the core code as other architectures might be able to reuse/extend it. For now it only implements allocations for single CPUs, but it's simple to add multi CPU allocation support if required. The concept is rather simple: Global information: system_vector bitmap global accounting PerCPU information: allocation bitmap managed allocation bitmap local accounting The system vector bitmap is used to exclude vectors system wide from the allocation space. The allocation bitmap is used to keep track of per cpu used vectors. The managed allocation bitmap is used to reserve vectors for managed interrupts. When a regular (non managed) interrupt allocation happens then the following rule applies: tmpmap = system_map | alloc_map | managed_map find_zero_bit(tmpmap) Oring the bitmaps together gives the real available space. The same rule applies for reserving a managed interrupt vector. But contrary to the regular interrupts the reservation only marks the bit in the managed map and therefor excludes it from the regular allocations. The managed map is only cleaned out when the a managed interrupt is completely released and it stays alive accross CPU offline/online operations. For managed interrupt allocations the rule is: tmpmap = managed_map & ~alloc_map find_first_bit(tmpmap) This returns the first bit which is in the managed map, but not yet allocated in the allocation map. The allocation marks it in the allocation map and hands it back to the caller for use. The rest of the code are helper functions to handle the various requirements and the accounting which are necessary to replace the x86 vector allocation code. The result is a single patch as the evolution of this infrastructure cannot be represented in bits and pieces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Yu Chen <yu.c.chen@intel.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/20170913213153.185437174@linutronix.de
2017-09-14 05:29:14 +08:00
/*
* Not required in theory, but matrix_find_best_cpu() uses
* for_each_cpu() which ignores the cpumask on UP .
*/
if (cpumask_empty(msk))
return -EINVAL;
cpu = matrix_find_best_cpu(m, msk);
if (cpu == UINT_MAX)
return -ENOSPC;
genirq: Implement bitmap matrix allocator Implement the infrastructure for a simple bitmap based allocator, which will replace the x86 vector allocator. It's in the core code as other architectures might be able to reuse/extend it. For now it only implements allocations for single CPUs, but it's simple to add multi CPU allocation support if required. The concept is rather simple: Global information: system_vector bitmap global accounting PerCPU information: allocation bitmap managed allocation bitmap local accounting The system vector bitmap is used to exclude vectors system wide from the allocation space. The allocation bitmap is used to keep track of per cpu used vectors. The managed allocation bitmap is used to reserve vectors for managed interrupts. When a regular (non managed) interrupt allocation happens then the following rule applies: tmpmap = system_map | alloc_map | managed_map find_zero_bit(tmpmap) Oring the bitmaps together gives the real available space. The same rule applies for reserving a managed interrupt vector. But contrary to the regular interrupts the reservation only marks the bit in the managed map and therefor excludes it from the regular allocations. The managed map is only cleaned out when the a managed interrupt is completely released and it stays alive accross CPU offline/online operations. For managed interrupt allocations the rule is: tmpmap = managed_map & ~alloc_map find_first_bit(tmpmap) This returns the first bit which is in the managed map, but not yet allocated in the allocation map. The allocation marks it in the allocation map and hands it back to the caller for use. The rest of the code are helper functions to handle the various requirements and the accounting which are necessary to replace the x86 vector allocation code. The result is a single patch as the evolution of this infrastructure cannot be represented in bits and pieces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Yu Chen <yu.c.chen@intel.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/20170913213153.185437174@linutronix.de
2017-09-14 05:29:14 +08:00
cm = per_cpu_ptr(m->maps, cpu);
bit = matrix_alloc_area(m, cm, 1, false);
if (bit >= m->alloc_end)
return -ENOSPC;
cm->allocated++;
cm->available--;
m->total_allocated++;
m->global_available--;
if (reserved)
m->global_reserved--;
*mapped_cpu = cpu;
trace_irq_matrix_alloc(bit, cpu, m, cm);
return bit;
genirq: Implement bitmap matrix allocator Implement the infrastructure for a simple bitmap based allocator, which will replace the x86 vector allocator. It's in the core code as other architectures might be able to reuse/extend it. For now it only implements allocations for single CPUs, but it's simple to add multi CPU allocation support if required. The concept is rather simple: Global information: system_vector bitmap global accounting PerCPU information: allocation bitmap managed allocation bitmap local accounting The system vector bitmap is used to exclude vectors system wide from the allocation space. The allocation bitmap is used to keep track of per cpu used vectors. The managed allocation bitmap is used to reserve vectors for managed interrupts. When a regular (non managed) interrupt allocation happens then the following rule applies: tmpmap = system_map | alloc_map | managed_map find_zero_bit(tmpmap) Oring the bitmaps together gives the real available space. The same rule applies for reserving a managed interrupt vector. But contrary to the regular interrupts the reservation only marks the bit in the managed map and therefor excludes it from the regular allocations. The managed map is only cleaned out when the a managed interrupt is completely released and it stays alive accross CPU offline/online operations. For managed interrupt allocations the rule is: tmpmap = managed_map & ~alloc_map find_first_bit(tmpmap) This returns the first bit which is in the managed map, but not yet allocated in the allocation map. The allocation marks it in the allocation map and hands it back to the caller for use. The rest of the code are helper functions to handle the various requirements and the accounting which are necessary to replace the x86 vector allocation code. The result is a single patch as the evolution of this infrastructure cannot be represented in bits and pieces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Yu Chen <yu.c.chen@intel.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/20170913213153.185437174@linutronix.de
2017-09-14 05:29:14 +08:00
}
/**
* irq_matrix_free - Free allocated interrupt in the matrix
* @m: Matrix pointer
* @cpu: Which CPU map needs be updated
* @bit: The bit to remove
* @managed: If true, the interrupt is managed and not accounted
* as available.
*/
void irq_matrix_free(struct irq_matrix *m, unsigned int cpu,
unsigned int bit, bool managed)
{
struct cpumap *cm = per_cpu_ptr(m->maps, cpu);
if (WARN_ON_ONCE(bit < m->alloc_start || bit >= m->alloc_end))
return;
clear_bit(bit, cm->alloc_map);
cm->allocated--;
genirq/matrix: Improve target CPU selection for managed interrupts. On large systems with multiple devices of the same class (e.g. NVMe disks, using managed interrupts), the kernel can affinitize these interrupts to a small subset of CPUs instead of spreading them out evenly. irq_matrix_alloc_managed() tries to select the CPU in the supplied cpumask of possible target CPUs which has the lowest number of interrupt vectors allocated. This is done by searching the CPU with the highest number of available vectors. While this is correct for non-managed CPUs it can select the wrong CPU for managed interrupts. Under certain constellations this results in affinitizing the managed interrupts of several devices to a single CPU in a set. The book keeping of available vectors works the following way: 1) Non-managed interrupts: available is decremented when the interrupt is actually requested by the device driver and a vector is assigned. It's incremented when the interrupt and the vector are freed. 2) Managed interrupts: Managed interrupts guarantee vector reservation when the MSI/MSI-X functionality of a device is enabled, which is achieved by reserving vectors in the bitmaps of the possible target CPUs. This reservation decrements the available count on each possible target CPU. When the interrupt is requested by the device driver then a vector is allocated from the reserved region. The operation is reversed when the interrupt is freed by the device driver. Neither of these operations affect the available count. The reservation persist up to the point where the MSI/MSI-X functionality is disabled and only this operation increments the available count again. For non-managed interrupts the available count is the correct selection criterion because the guaranteed reservations need to be taken into account. Using the allocated counter could lead to a failing allocation in the following situation (total vector space of 10 assumed): CPU0 CPU1 available: 2 0 allocated: 5 3 <--- CPU1 is selected, but available space = 0 managed reserved: 3 7 while available yields the correct result. For managed interrupts the available count is not the appropriate selection criterion because as explained above the available count is not affected by the actual vector allocation. The following example illustrates that. Total vector space of 10 assumed. The starting point is: CPU0 CPU1 available: 5 4 allocated: 2 3 managed reserved: 3 3 Allocating vectors for three non-managed interrupts will result in affinitizing the first two to CPU0 and the third one to CPU1 because the available count is adjusted with each allocation: CPU0 CPU1 available: 5 4 <- Select CPU0 for 1st allocation --> allocated: 3 3 available: 4 4 <- Select CPU0 for 2nd allocation --> allocated: 4 3 available: 3 4 <- Select CPU1 for 3rd allocation --> allocated: 4 4 But the allocation of three managed interrupts starting from the same point will affinitize all of them to CPU0 because the available count is not affected by the allocation (see above). So the end result is: CPU0 CPU1 available: 5 4 allocated: 5 3 Introduce a "managed_allocated" field in struct cpumap to track the vector allocation for managed interrupts separately. Use this information to select the target CPU when a vector is allocated for a managed interrupt, which results in more evenly distributed vector assignments. The above example results in the following allocations: CPU0 CPU1 managed_allocated: 0 0 <- Select CPU0 for 1st allocation --> allocated: 3 3 managed_allocated: 1 0 <- Select CPU1 for 2nd allocation --> allocated: 3 4 managed_allocated: 1 1 <- Select CPU0 for 3rd allocation --> allocated: 4 4 The allocation of non-managed interrupts is not affected by this change and is still evaluating the available count. The overall distribution of interrupt vectors for both types of interrupts might still not be perfectly even depending on the number of non-managed and managed interrupts in a system, but due to the reservation guarantee for managed interrupts this cannot be avoided. Expose the new field in debugfs as well. [ tglx: Clarified the background of the problem in the changelog and described it independent of NVME ] Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Michael Kelley <mikelley@microsoft.com> Link: https://lkml.kernel.org/r/20181106040000.27316-1-longli@linuxonhyperv.com
2018-11-06 12:00:00 +08:00
if(managed)
cm->managed_allocated--;
if (cm->online)
genirq: Implement bitmap matrix allocator Implement the infrastructure for a simple bitmap based allocator, which will replace the x86 vector allocator. It's in the core code as other architectures might be able to reuse/extend it. For now it only implements allocations for single CPUs, but it's simple to add multi CPU allocation support if required. The concept is rather simple: Global information: system_vector bitmap global accounting PerCPU information: allocation bitmap managed allocation bitmap local accounting The system vector bitmap is used to exclude vectors system wide from the allocation space. The allocation bitmap is used to keep track of per cpu used vectors. The managed allocation bitmap is used to reserve vectors for managed interrupts. When a regular (non managed) interrupt allocation happens then the following rule applies: tmpmap = system_map | alloc_map | managed_map find_zero_bit(tmpmap) Oring the bitmaps together gives the real available space. The same rule applies for reserving a managed interrupt vector. But contrary to the regular interrupts the reservation only marks the bit in the managed map and therefor excludes it from the regular allocations. The managed map is only cleaned out when the a managed interrupt is completely released and it stays alive accross CPU offline/online operations. For managed interrupt allocations the rule is: tmpmap = managed_map & ~alloc_map find_first_bit(tmpmap) This returns the first bit which is in the managed map, but not yet allocated in the allocation map. The allocation marks it in the allocation map and hands it back to the caller for use. The rest of the code are helper functions to handle the various requirements and the accounting which are necessary to replace the x86 vector allocation code. The result is a single patch as the evolution of this infrastructure cannot be represented in bits and pieces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Yu Chen <yu.c.chen@intel.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/20170913213153.185437174@linutronix.de
2017-09-14 05:29:14 +08:00
m->total_allocated--;
if (!managed) {
cm->available++;
if (cm->online)
genirq: Implement bitmap matrix allocator Implement the infrastructure for a simple bitmap based allocator, which will replace the x86 vector allocator. It's in the core code as other architectures might be able to reuse/extend it. For now it only implements allocations for single CPUs, but it's simple to add multi CPU allocation support if required. The concept is rather simple: Global information: system_vector bitmap global accounting PerCPU information: allocation bitmap managed allocation bitmap local accounting The system vector bitmap is used to exclude vectors system wide from the allocation space. The allocation bitmap is used to keep track of per cpu used vectors. The managed allocation bitmap is used to reserve vectors for managed interrupts. When a regular (non managed) interrupt allocation happens then the following rule applies: tmpmap = system_map | alloc_map | managed_map find_zero_bit(tmpmap) Oring the bitmaps together gives the real available space. The same rule applies for reserving a managed interrupt vector. But contrary to the regular interrupts the reservation only marks the bit in the managed map and therefor excludes it from the regular allocations. The managed map is only cleaned out when the a managed interrupt is completely released and it stays alive accross CPU offline/online operations. For managed interrupt allocations the rule is: tmpmap = managed_map & ~alloc_map find_first_bit(tmpmap) This returns the first bit which is in the managed map, but not yet allocated in the allocation map. The allocation marks it in the allocation map and hands it back to the caller for use. The rest of the code are helper functions to handle the various requirements and the accounting which are necessary to replace the x86 vector allocation code. The result is a single patch as the evolution of this infrastructure cannot be represented in bits and pieces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Yu Chen <yu.c.chen@intel.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/20170913213153.185437174@linutronix.de
2017-09-14 05:29:14 +08:00
m->global_available++;
}
trace_irq_matrix_free(bit, cpu, m, cm);
genirq: Implement bitmap matrix allocator Implement the infrastructure for a simple bitmap based allocator, which will replace the x86 vector allocator. It's in the core code as other architectures might be able to reuse/extend it. For now it only implements allocations for single CPUs, but it's simple to add multi CPU allocation support if required. The concept is rather simple: Global information: system_vector bitmap global accounting PerCPU information: allocation bitmap managed allocation bitmap local accounting The system vector bitmap is used to exclude vectors system wide from the allocation space. The allocation bitmap is used to keep track of per cpu used vectors. The managed allocation bitmap is used to reserve vectors for managed interrupts. When a regular (non managed) interrupt allocation happens then the following rule applies: tmpmap = system_map | alloc_map | managed_map find_zero_bit(tmpmap) Oring the bitmaps together gives the real available space. The same rule applies for reserving a managed interrupt vector. But contrary to the regular interrupts the reservation only marks the bit in the managed map and therefor excludes it from the regular allocations. The managed map is only cleaned out when the a managed interrupt is completely released and it stays alive accross CPU offline/online operations. For managed interrupt allocations the rule is: tmpmap = managed_map & ~alloc_map find_first_bit(tmpmap) This returns the first bit which is in the managed map, but not yet allocated in the allocation map. The allocation marks it in the allocation map and hands it back to the caller for use. The rest of the code are helper functions to handle the various requirements and the accounting which are necessary to replace the x86 vector allocation code. The result is a single patch as the evolution of this infrastructure cannot be represented in bits and pieces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Yu Chen <yu.c.chen@intel.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/20170913213153.185437174@linutronix.de
2017-09-14 05:29:14 +08:00
}
/**
* irq_matrix_available - Get the number of globally available irqs
* @m: Pointer to the matrix to query
* @cpudown: If true, the local CPU is about to go down, adjust
* the number of available irqs accordingly
*/
unsigned int irq_matrix_available(struct irq_matrix *m, bool cpudown)
{
struct cpumap *cm = this_cpu_ptr(m->maps);
if (!cpudown)
return m->global_available;
return m->global_available - cm->available;
genirq: Implement bitmap matrix allocator Implement the infrastructure for a simple bitmap based allocator, which will replace the x86 vector allocator. It's in the core code as other architectures might be able to reuse/extend it. For now it only implements allocations for single CPUs, but it's simple to add multi CPU allocation support if required. The concept is rather simple: Global information: system_vector bitmap global accounting PerCPU information: allocation bitmap managed allocation bitmap local accounting The system vector bitmap is used to exclude vectors system wide from the allocation space. The allocation bitmap is used to keep track of per cpu used vectors. The managed allocation bitmap is used to reserve vectors for managed interrupts. When a regular (non managed) interrupt allocation happens then the following rule applies: tmpmap = system_map | alloc_map | managed_map find_zero_bit(tmpmap) Oring the bitmaps together gives the real available space. The same rule applies for reserving a managed interrupt vector. But contrary to the regular interrupts the reservation only marks the bit in the managed map and therefor excludes it from the regular allocations. The managed map is only cleaned out when the a managed interrupt is completely released and it stays alive accross CPU offline/online operations. For managed interrupt allocations the rule is: tmpmap = managed_map & ~alloc_map find_first_bit(tmpmap) This returns the first bit which is in the managed map, but not yet allocated in the allocation map. The allocation marks it in the allocation map and hands it back to the caller for use. The rest of the code are helper functions to handle the various requirements and the accounting which are necessary to replace the x86 vector allocation code. The result is a single patch as the evolution of this infrastructure cannot be represented in bits and pieces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Yu Chen <yu.c.chen@intel.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/20170913213153.185437174@linutronix.de
2017-09-14 05:29:14 +08:00
}
/**
* irq_matrix_reserved - Get the number of globally reserved irqs
* @m: Pointer to the matrix to query
*/
unsigned int irq_matrix_reserved(struct irq_matrix *m)
{
return m->global_reserved;
}
/**
* irq_matrix_allocated - Get the number of allocated irqs on the local cpu
* @m: Pointer to the matrix to search
*
* This returns number of allocated irqs
*/
unsigned int irq_matrix_allocated(struct irq_matrix *m)
{
struct cpumap *cm = this_cpu_ptr(m->maps);
return cm->allocated;
}
#ifdef CONFIG_GENERIC_IRQ_DEBUGFS
/**
* irq_matrix_debug_show - Show detailed allocation information
* @sf: Pointer to the seq_file to print to
* @m: Pointer to the matrix allocator
* @ind: Indentation for the print format
*
* Note, this is a lockless snapshot.
*/
void irq_matrix_debug_show(struct seq_file *sf, struct irq_matrix *m, int ind)
{
unsigned int nsys = bitmap_weight(m->system_map, m->matrix_bits);
int cpu;
seq_printf(sf, "Online bitmaps: %6u\n", m->online_maps);
seq_printf(sf, "Global available: %6u\n", m->global_available);
seq_printf(sf, "Global reserved: %6u\n", m->global_reserved);
seq_printf(sf, "Total allocated: %6u\n", m->total_allocated);
seq_printf(sf, "System: %u: %*pbl\n", nsys, m->matrix_bits,
m->system_map);
genirq/matrix: Improve target CPU selection for managed interrupts. On large systems with multiple devices of the same class (e.g. NVMe disks, using managed interrupts), the kernel can affinitize these interrupts to a small subset of CPUs instead of spreading them out evenly. irq_matrix_alloc_managed() tries to select the CPU in the supplied cpumask of possible target CPUs which has the lowest number of interrupt vectors allocated. This is done by searching the CPU with the highest number of available vectors. While this is correct for non-managed CPUs it can select the wrong CPU for managed interrupts. Under certain constellations this results in affinitizing the managed interrupts of several devices to a single CPU in a set. The book keeping of available vectors works the following way: 1) Non-managed interrupts: available is decremented when the interrupt is actually requested by the device driver and a vector is assigned. It's incremented when the interrupt and the vector are freed. 2) Managed interrupts: Managed interrupts guarantee vector reservation when the MSI/MSI-X functionality of a device is enabled, which is achieved by reserving vectors in the bitmaps of the possible target CPUs. This reservation decrements the available count on each possible target CPU. When the interrupt is requested by the device driver then a vector is allocated from the reserved region. The operation is reversed when the interrupt is freed by the device driver. Neither of these operations affect the available count. The reservation persist up to the point where the MSI/MSI-X functionality is disabled and only this operation increments the available count again. For non-managed interrupts the available count is the correct selection criterion because the guaranteed reservations need to be taken into account. Using the allocated counter could lead to a failing allocation in the following situation (total vector space of 10 assumed): CPU0 CPU1 available: 2 0 allocated: 5 3 <--- CPU1 is selected, but available space = 0 managed reserved: 3 7 while available yields the correct result. For managed interrupts the available count is not the appropriate selection criterion because as explained above the available count is not affected by the actual vector allocation. The following example illustrates that. Total vector space of 10 assumed. The starting point is: CPU0 CPU1 available: 5 4 allocated: 2 3 managed reserved: 3 3 Allocating vectors for three non-managed interrupts will result in affinitizing the first two to CPU0 and the third one to CPU1 because the available count is adjusted with each allocation: CPU0 CPU1 available: 5 4 <- Select CPU0 for 1st allocation --> allocated: 3 3 available: 4 4 <- Select CPU0 for 2nd allocation --> allocated: 4 3 available: 3 4 <- Select CPU1 for 3rd allocation --> allocated: 4 4 But the allocation of three managed interrupts starting from the same point will affinitize all of them to CPU0 because the available count is not affected by the allocation (see above). So the end result is: CPU0 CPU1 available: 5 4 allocated: 5 3 Introduce a "managed_allocated" field in struct cpumap to track the vector allocation for managed interrupts separately. Use this information to select the target CPU when a vector is allocated for a managed interrupt, which results in more evenly distributed vector assignments. The above example results in the following allocations: CPU0 CPU1 managed_allocated: 0 0 <- Select CPU0 for 1st allocation --> allocated: 3 3 managed_allocated: 1 0 <- Select CPU1 for 2nd allocation --> allocated: 3 4 managed_allocated: 1 1 <- Select CPU0 for 3rd allocation --> allocated: 4 4 The allocation of non-managed interrupts is not affected by this change and is still evaluating the available count. The overall distribution of interrupt vectors for both types of interrupts might still not be perfectly even depending on the number of non-managed and managed interrupts in a system, but due to the reservation guarantee for managed interrupts this cannot be avoided. Expose the new field in debugfs as well. [ tglx: Clarified the background of the problem in the changelog and described it independent of NVME ] Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Michael Kelley <mikelley@microsoft.com> Link: https://lkml.kernel.org/r/20181106040000.27316-1-longli@linuxonhyperv.com
2018-11-06 12:00:00 +08:00
seq_printf(sf, "%*s| CPU | avl | man | mac | act | vectors\n", ind, " ");
genirq: Implement bitmap matrix allocator Implement the infrastructure for a simple bitmap based allocator, which will replace the x86 vector allocator. It's in the core code as other architectures might be able to reuse/extend it. For now it only implements allocations for single CPUs, but it's simple to add multi CPU allocation support if required. The concept is rather simple: Global information: system_vector bitmap global accounting PerCPU information: allocation bitmap managed allocation bitmap local accounting The system vector bitmap is used to exclude vectors system wide from the allocation space. The allocation bitmap is used to keep track of per cpu used vectors. The managed allocation bitmap is used to reserve vectors for managed interrupts. When a regular (non managed) interrupt allocation happens then the following rule applies: tmpmap = system_map | alloc_map | managed_map find_zero_bit(tmpmap) Oring the bitmaps together gives the real available space. The same rule applies for reserving a managed interrupt vector. But contrary to the regular interrupts the reservation only marks the bit in the managed map and therefor excludes it from the regular allocations. The managed map is only cleaned out when the a managed interrupt is completely released and it stays alive accross CPU offline/online operations. For managed interrupt allocations the rule is: tmpmap = managed_map & ~alloc_map find_first_bit(tmpmap) This returns the first bit which is in the managed map, but not yet allocated in the allocation map. The allocation marks it in the allocation map and hands it back to the caller for use. The rest of the code are helper functions to handle the various requirements and the accounting which are necessary to replace the x86 vector allocation code. The result is a single patch as the evolution of this infrastructure cannot be represented in bits and pieces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Yu Chen <yu.c.chen@intel.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/20170913213153.185437174@linutronix.de
2017-09-14 05:29:14 +08:00
cpus_read_lock();
for_each_online_cpu(cpu) {
struct cpumap *cm = per_cpu_ptr(m->maps, cpu);
genirq/matrix: Improve target CPU selection for managed interrupts. On large systems with multiple devices of the same class (e.g. NVMe disks, using managed interrupts), the kernel can affinitize these interrupts to a small subset of CPUs instead of spreading them out evenly. irq_matrix_alloc_managed() tries to select the CPU in the supplied cpumask of possible target CPUs which has the lowest number of interrupt vectors allocated. This is done by searching the CPU with the highest number of available vectors. While this is correct for non-managed CPUs it can select the wrong CPU for managed interrupts. Under certain constellations this results in affinitizing the managed interrupts of several devices to a single CPU in a set. The book keeping of available vectors works the following way: 1) Non-managed interrupts: available is decremented when the interrupt is actually requested by the device driver and a vector is assigned. It's incremented when the interrupt and the vector are freed. 2) Managed interrupts: Managed interrupts guarantee vector reservation when the MSI/MSI-X functionality of a device is enabled, which is achieved by reserving vectors in the bitmaps of the possible target CPUs. This reservation decrements the available count on each possible target CPU. When the interrupt is requested by the device driver then a vector is allocated from the reserved region. The operation is reversed when the interrupt is freed by the device driver. Neither of these operations affect the available count. The reservation persist up to the point where the MSI/MSI-X functionality is disabled and only this operation increments the available count again. For non-managed interrupts the available count is the correct selection criterion because the guaranteed reservations need to be taken into account. Using the allocated counter could lead to a failing allocation in the following situation (total vector space of 10 assumed): CPU0 CPU1 available: 2 0 allocated: 5 3 <--- CPU1 is selected, but available space = 0 managed reserved: 3 7 while available yields the correct result. For managed interrupts the available count is not the appropriate selection criterion because as explained above the available count is not affected by the actual vector allocation. The following example illustrates that. Total vector space of 10 assumed. The starting point is: CPU0 CPU1 available: 5 4 allocated: 2 3 managed reserved: 3 3 Allocating vectors for three non-managed interrupts will result in affinitizing the first two to CPU0 and the third one to CPU1 because the available count is adjusted with each allocation: CPU0 CPU1 available: 5 4 <- Select CPU0 for 1st allocation --> allocated: 3 3 available: 4 4 <- Select CPU0 for 2nd allocation --> allocated: 4 3 available: 3 4 <- Select CPU1 for 3rd allocation --> allocated: 4 4 But the allocation of three managed interrupts starting from the same point will affinitize all of them to CPU0 because the available count is not affected by the allocation (see above). So the end result is: CPU0 CPU1 available: 5 4 allocated: 5 3 Introduce a "managed_allocated" field in struct cpumap to track the vector allocation for managed interrupts separately. Use this information to select the target CPU when a vector is allocated for a managed interrupt, which results in more evenly distributed vector assignments. The above example results in the following allocations: CPU0 CPU1 managed_allocated: 0 0 <- Select CPU0 for 1st allocation --> allocated: 3 3 managed_allocated: 1 0 <- Select CPU1 for 2nd allocation --> allocated: 3 4 managed_allocated: 1 1 <- Select CPU0 for 3rd allocation --> allocated: 4 4 The allocation of non-managed interrupts is not affected by this change and is still evaluating the available count. The overall distribution of interrupt vectors for both types of interrupts might still not be perfectly even depending on the number of non-managed and managed interrupts in a system, but due to the reservation guarantee for managed interrupts this cannot be avoided. Expose the new field in debugfs as well. [ tglx: Clarified the background of the problem in the changelog and described it independent of NVME ] Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Michael Kelley <mikelley@microsoft.com> Link: https://lkml.kernel.org/r/20181106040000.27316-1-longli@linuxonhyperv.com
2018-11-06 12:00:00 +08:00
seq_printf(sf, "%*s %4d %4u %4u %4u %4u %*pbl\n", ind, " ",
cpu, cm->available, cm->managed,
cm->managed_allocated, cm->allocated,
genirq: Implement bitmap matrix allocator Implement the infrastructure for a simple bitmap based allocator, which will replace the x86 vector allocator. It's in the core code as other architectures might be able to reuse/extend it. For now it only implements allocations for single CPUs, but it's simple to add multi CPU allocation support if required. The concept is rather simple: Global information: system_vector bitmap global accounting PerCPU information: allocation bitmap managed allocation bitmap local accounting The system vector bitmap is used to exclude vectors system wide from the allocation space. The allocation bitmap is used to keep track of per cpu used vectors. The managed allocation bitmap is used to reserve vectors for managed interrupts. When a regular (non managed) interrupt allocation happens then the following rule applies: tmpmap = system_map | alloc_map | managed_map find_zero_bit(tmpmap) Oring the bitmaps together gives the real available space. The same rule applies for reserving a managed interrupt vector. But contrary to the regular interrupts the reservation only marks the bit in the managed map and therefor excludes it from the regular allocations. The managed map is only cleaned out when the a managed interrupt is completely released and it stays alive accross CPU offline/online operations. For managed interrupt allocations the rule is: tmpmap = managed_map & ~alloc_map find_first_bit(tmpmap) This returns the first bit which is in the managed map, but not yet allocated in the allocation map. The allocation marks it in the allocation map and hands it back to the caller for use. The rest of the code are helper functions to handle the various requirements and the accounting which are necessary to replace the x86 vector allocation code. The result is a single patch as the evolution of this infrastructure cannot be represented in bits and pieces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Yu Chen <yu.c.chen@intel.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/20170913213153.185437174@linutronix.de
2017-09-14 05:29:14 +08:00
m->matrix_bits, cm->alloc_map);
}
cpus_read_unlock();
}
#endif