The basic idea behind cross memory attach is to allow MPI programs doing
intra-node communication to do a single copy of the message rather than a
double copy of the message via shared memory.
The following patch attempts to achieve this by allowing a destination
process, given an address and size from a source process, to copy memory
directly from the source process into its own address space via a system
call. There is also a symmetrical ability to copy from the current
process's address space into a destination process's address space.
- Use of /proc/pid/mem has been considered, but there are issues with
using it:
- Does not allow for specifying iovecs for both src and dest, assuming
preadv or pwritev was implemented either the area read from or
written to would need to be contiguous.
- Currently mem_read allows only processes who are currently
ptrace'ing the target and are still able to ptrace the target to read
from the target. This check could possibly be moved to the open call,
but its not clear exactly what race this restriction is stopping
(reason appears to have been lost)
- Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
domain socket is a bit ugly from a userspace point of view,
especially when you may have hundreds if not (eventually) thousands
of processes that all need to do this with each other
- Doesn't allow for some future use of the interface we would like to
consider adding in the future (see below)
- Interestingly reading from /proc/pid/mem currently actually
involves two copies! (But this could be fixed pretty easily)
As mentioned previously use of vmsplice instead was considered, but has
problems. Since you need the reader and writer working co-operatively if
the pipe is not drained then you block. Which requires some wrapping to
do non blocking on the send side or polling on the receive. In all to all
communication it requires ordering otherwise you can deadlock. And in the
example of many MPI tasks writing to one MPI task vmsplice serialises the
copying.
There are some cases of MPI collectives where even a single copy interface
does not get us the performance gain we could. For example in an
MPI_Reduce rather than copy the data from the source we would like to
instead use it directly in a mathops (say the reduce is doing a sum) as
this would save us doing a copy. We don't need to keep a copy of the data
from the source. I haven't implemented this, but I think this interface
could in the future do all this through the use of the flags - eg could
specify the math operation and type and the kernel rather than just
copying the data would apply the specified operation between the source
and destination and store it in the destination.
Although we don't have a "second user" of the interface (though I've had
some nibbles from people who may be interested in using it for intra
process messaging which is not MPI). This interface is something which
hardware vendors are already doing for their custom drivers to implement
fast local communication. And so in addition to this being useful for
OpenMPI it would mean the driver maintainers don't have to fix things up
when the mm changes.
There was some discussion about how much faster a true zero copy would
go. Here's a link back to the email with some testing I did on that:
http://marc.info/?l=linux-mm&m=130105930902915&w=2
There is a basic man page for the proposed interface here:
http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt
This has been implemented for x86 and powerpc, other architecture should
mainly (I think) just need to add syscall numbers for the process_vm_readv
and process_vm_writev. There are 32 bit compatibility versions for
64-bit kernels.
For arch maintainers there are some simple tests to be able to quickly
verify that the syscalls are working correctly here:
http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz
Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: <linux-man@vger.kernel.org>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the wrong use of schedule_hrtimeout_range_clock() in wq_sleep(),
although it is harmless for the syscall mq_timed* now. It was introduced
by 9ca7d8e ("mqueue: Convert message queue timeout to use hrtimers").
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Cc: Carsten Emde <C.Emde@osadl.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The display of the "huge" tag was accidentally removed in 29ea2f698 ("mm:
use walk_page_range() instead of custom page table walking code").
Reported-by: Stephen Hemminger <shemminger@vyatta.com>
Tested-by: Stephen Hemminger <shemminger@vyatta.com>
Reviewed-by: Stephen Wilson <wilsons@start.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
x86_64 allnoconfig:
In file included from arch/x86/kernel/pci-dma.c:3:
include/linux/dmar.h:248: warning: 'struct acpi_dmar_header' declared inside parameter list
include/linux/dmar.h:248: warning: its scope is only this definition or declaration, which is probably not what you want
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 5fd75a7850 (dma-mapping: remove unnecessary sync_single_range_*
in dma_map_ops) unified not only the dma_map_ops but also the
corresponding debug_dma_sync_* calls. This led to spurious WARN()ings
like the following because the DMA debug code was no longer able to detect
the DMA buffer base address without the separate offset parameter:
WARNING: at lib/dma-debug.c:911 check_sync+0xce/0x446()
firewire_ohci 0000:04:00.0: DMA-API: device driver tries to sync DMA memory it has not allocated [device address=0x00000000cedaa400] [size=1024 bytes]
Call Trace: ...
[<ffffffff811326a5>] check_sync+0xce/0x446
[<ffffffff81132ad9>] debug_dma_sync_single_for_device+0x39/0x3b
[<ffffffffa01d6e6a>] ohci_queue_iso+0x4f3/0x77d [firewire_ohci]
...
To fix this, unshare the sync_single_* and sync_single_range_*
implementations so that we are able to call the correct debug_dma_sync_*
functions.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The module.h cleanup series is not merged at this point, so use the
older header file for now, to make it build either way.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
* 'next' of git://git.monstr.eu/linux-2.6-microblaze:
microblaze: Remove __ARCH_WANT_INTERRUPTS_ON_CTXSW usage
microblaze: Use delay slot in __strnlen_user, __strncpy_user
microblaze: Remove NET_IP_ALIGN from system.h
microblaze: Add __ucmpdi2() helper function
microblaze: Raise SIGFPE/FPE_INTDIV for div by zero
microblaze: Switch ELF_ARCH code to 189
microblaze: Added DMA sync operations
microblaze: Moved __dma_sync() to dma-mapping.h
microblaze: Add PVR for Microblaze v8.20.a
microblaze: Fix access_ok macro
microblaze: Add loop unrolling for PAGE in copy_tofrom_user
microblaze: Simplify logic for unaligned byte copying
microblaze: Change label names - copy_tofrom_user
microblaze: Separate fixup section definition
microblaze: Change label name in copy_tofrom_user
microblaze: Clear top bit from cnt32_to_63
The exynos4 updates conflict with code from the arm devel-stable branch
and new boards need to set atag_offset in place of boot_param.
Conflicts:
arch/arm/Kconfig
arch/arm/mach-exynos4/include/mach/entry-macro.S
arch/arm/mach-exynos4/mach-smdkc210.c
arch/arm/mach-exynos4/mach-smdkv310.c
arch/arm/mach-exynos4/mct.c
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
* 'for_linus' of git://cavan.codon.org.uk/platform-drivers-x86: (45 commits)
acer-wmi: replaced the hard coded bitmap by the communication devices bitmap from SMBIOS
acer-wmi: check the existence of internal wireless device when set capability
acer-wmi: add ACER_WMID_v2 interface flag to represent new notebooks
sony-laptop:irq: Remove IRQF_DISABLED
asus-laptop: Add rfkill support for Pegatron Lucid tablet
asus-laptop: pega_accel - Report accelerometer orientation change through udev
asus-laptop: fix module description
asus-laptop: hide leds on Pegatron Lucid
asus-laptop: Pegatron Lucid accelerometer
asus-laptop: allow boot time control of Pegatron ALS sensor
Platform: samsung_laptop: add support for X520 machines.
platform: samsung_laptop: add dmi information for Samsung R700 laptops
hp_accel: Add axis-mapping for HP ProBook / EliteBook
hp_accel: Add a new PNP id
WMI: properly cleanup devices to avoid crashes
ideapad: remove sysfs node for cfg
ideapad: add debugfs support
ideapad: add event for Novo key
ideapad: change parameter of ideapad_sync_rfk_state
ideapad: define vpc commands
...
* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (348 commits)
[media] pctv452e: Remove bogus code
[media] adv7175: Make use of media bus pixel codes
[media] media: vb2: fix incorrect return value
[media] em28xx: implement VIDIOC_ENUM_FRAMESIZES
[media] cx23885: Stop the risc video fifo before reconfiguring it
[media] cx23885: Avoid incorrect error handling and reporting
[media] cx23885: Avoid stopping the risc engine during buffer timeout
[media] cx23885: Removed a spurious function cx23885_set_scale()
[media] cx23885: v4l2 api compliance, set the audioset field correctly
[media] cx23885: hook the audio selection functions into the main driver
[media] cx23885: add generic functions for dealing with audio input selection
[media] cx23885: fixes related to maximum number of inputs and range checking
[media] cx23885: Initial support for the MPX-885 mini-card
[media] cx25840: Ensure AUDIO6 and AUDIO7 trigger line-in baseband use
[media] cx23885: Enable audio line in support from the back panel
[media] cx23885: Allow the audio mux config to be specified on a per input basis
[media] cx25840: Enable support for non-tuner LR1/LR2 audio inputs
[media] cx23885: Name an internal i2c part and declare a bitfield by name
[media] cx23885: Ensure VBI buffers timeout quickly - bugfix for vbi hangs during streaming
[media] cx23885: remove channel dump diagnostics when a vbi buffer times out
...
Fix up trivial conflicts in drivers/misc/altera-stapl/altera.c (header
file rename vs add)
The timer and cleanup branches from stericsson conflict,
so I'm merging them here.
Conflicts:
arch/arm/mach-ux500/Makefile
arch/arm/mach-ux500/cpu.c
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
sparc64: Fix masking and shifting in VIS fpcmp emulation.
sparc32: Correct the return value of memcpy.
sparc32: Remove uses of %g7 in memcpy implementation.
sparc32: Remove non-kernel code from memcpy implementation.
Omap cleanups conflicted with omap2_dss work in a nontrivial
way, this is the most logical fixup.
Conflicts:
arch/arm/mach-omap2/board-2430sdp.c
arch/arm/mach-omap2/board-4430sdp.c
arch/arm/mach-omap2/board-apollon.c
arch/arm/mach-omap2/board-h4.c
arch/arm/mach-omap2/board-ldp.c
arch/arm/mach-omap2/board-rx51.c
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Enable the maximum size (128) supported by the device for the shadow
vlans table, ignoring the module parameter that overrides it. This
table is only used by the IBoE control plane for setting a vlan index
into an RC/UC QP context or UD Address Handle.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
There's no need to set the vlan-related fields in an IBoE send WQE
control segment:
- the vlan to be used by a UD QP is set in the datagram segment.
- for GSI (CM) QP, all the headers down to 8021q and MAC are built by
the software anyway.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The IBoE port MTU is derived from the corresponding Ethernet netdevice
MTU, which can support jumbo frames of 9K, and hence surely supports
the max IB mtu of 4K.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
QPs need to be moved to error before telling the firwmare to shutdown
the queue. Otherwise, the application can submit WRs that will never
get fetched by the hardware and never flushed by the driver.
Signed-off-by: Kumar Sanghvi <kumaras@chelsio.com>
Acked-by: Steve Wise <swsie@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Commit 01e7da6ba5 ("RDMA/cxgb4: Make sure flush CQ entries are
collected on connection close") introduced a potential problem where a
CQ's comp_handler can get called simultaneously from different places
in the iw_cxgb4 driver. This does not comply with
Documentation/infiniband/core_locking.txt, which states that at a
given point of time, there should be only one callback per CQ should
be active.
This problem was reported by Parav Pandit <Parav.Pandit@Emulex.Com>.
Based on discussion between Parav Pandit and Steve Wise, this patch
fixes the above problem by serializing the calls to a CQ's
comp_handler using a spin_lock.
Reported-by: Parav Pandit <Parav.Pandit@Emulex.Com>
Signed-off-by: Kumar Sanghvi <kumaras@chelsio.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
iw_cxgb3 has a potential problem where a CQ's comp_handler can get
called simultaneously from different places in iw_cxgb3 driver. This
does not comply with Documentation/infiniband/core_locking.txt, which
states that at a given point of time, there should be only one
callback per CQ should be active.
Such problem was reported by Parav Pandit <Parav.Pandit@Emulex.Com>
for iw_cxgb4 driver. Based on discussion between Parav Pandit and
Steve Wise, this patch fixes the above problem by serializing the
calls to a CQ's comp_handler using a spin_lock.
Signed-off-by: Kumar Sanghvi <kumaras@chelsio.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Fix an issue where the link would come up after replugging a cable
even if it has been DISABLED manually.
Signed-off-by: Mitko Haralanov <mitko@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Clean up: the first parameter of nfs_create_request() has been
incorrectly documented since time immemorial (OK, since before
2.6.12).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
craa_type_mask is bitmap4 per RFC5661. We need to expect a length before
extracting bitmap value.
Cc: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Current pnfs_layoutcommit_inode can not handle parallel layoutcommit.
And as Trond suggested , there is no need for client to optimize for
parallel layoutcommit. So add NFS_INO_LAYOUTCOMMITTING flag to
mark inflight layoutcommit and serialize lalyoutcommit with it.
Also mark_inode_dirty_sync if pnfs_layoutcommit_inode fails to issue
layoutcommit.
Reported-by: Vitaliy Gusev <gusev.vitaliy@nexenta.com>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The patch merges the build of imx3 and imx6. The Kconfig symbol
ARCH_IMX_V6_V7 is introduced to replace ARCH_MX3 and ARCH_MX6.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
It adds a number of core drivers support for imx6q, including clock,
General Power Controller (gpc), Multi Mode DDR Controller(mmdc) and
System Reset Controller (src).
Signed-off-by: Ranjani Vaidyanathan <ra5478@freescale.com>
Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
This is a plain translation of assembly gic irq handler to C function
for CONFIG_MULTI_IRQ_HANDLER support on imx family.
Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
This adds cpu hotplug for highbank. On highbank, a core is always reset and
boots up the same path as a cold boot.
Signed-off-by: Martin Bogomolni <martin@calxeda.com>
Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Reviewed-by: Jamie Iles <jamie@jamieiles.com>
Reviewed-by: Shawn Guo <shawn.guo@linaro.org>
This enables SMP support on highbank processor.
Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Reviewed-by: Jamie Iles <jamie@jamieiles.com>
Reviewed-by: Shawn Guo <shawn.guo@linaro.org>