linux-next

mirror of https://github.com/edk2-porting/linux-next.git synced 2025-01-02 02:34:05 +08:00

Author	SHA1	Message	Date
Russell King	f6f14a0d71	fs/adfs: map: move map-specific sb initialisation to map.c Move map specific superblock initialisation to map.c, rather than having it spread into super.c. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:40 -05:00
Russell King	792314f8b2	fs/adfs: map: use find_next_bit_le() rather than open coding it Use find_next_bit_le() to find the end of a fragment in the map rather than open-coding this functionality. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:40 -05:00
Russell King	197ba3c519	fs/adfs: map: incorporate map offsets into layout lookup_zone() and scan_free_map() cope in different ways with the location of the map data within a zone: 1. lookup_zone() adds a four byte offset to the map data pointer to skip over the check and free link bytes. 2. scan_free_map() needs to use the free link pointer, which is an offset from itself, so we end up adding a 32-bit offset to the end pointer (aka mapsize) which is really confusing. Rename mapsize to endbit as this is really what it is, and incorporate the 32-bit offset into the map layout. This means that both dm_startbit and dm_endbit are now bit offsets from the start of the buffer, rather than four bytes in to the buffer. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:40 -05:00
Russell King	7b19526762	fs/adfs: map: factor out map cleanup We have several places which deal with releasing the map buffers and freeing the map array. Provide a helper for this. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:40 -05:00
Russell King	6092b6be30	fs/adfs: map: break up adfs_read_map() Split up adfs_read_map() into separate helpers to layout the map, read the map, and release the map buffers. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:40 -05:00
Russell King	e6160e469f	fs/adfs: map: rename adfs_map_free() to adfs_map_statfs() adfs_map_free() is not obvious whether it is freeing the map or returning the number of free blocks on the filesystem. Rename it to the more generic statfs() to make it clear that it's a statistic function. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:40 -05:00
Russell King	f75d398d6e	fs/adfs: map: move map reading and validation to map.c Keep all the map code together in map.c, rather than having some in super.c Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:40 -05:00
Russell King	81916245ce	fs/adfs: inode: fix adfs_mode2atts() Fix adfs_mode2atts() to actually update the file permissions on the media rather than using the current inode mode. Note also that directories do not have read/write permissions stored on the media. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:40 -05:00
Russell King	eeeb9dd98e	fs/adfs: inode: update timestamps to centisecond precision Despite ADFS timestamps having centi-second granularity, and Linux gaining fine-grained timestamp support in v2.5.48, fs/adfs was never updated. Update fs/adfs to centi-second support, and ensure that the inode ctime always reflects what is written in underlying media. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:40 -05:00
Pavel Begunkov	0463b6c58e	io_uring: use labeled array init in io_op_defs Don't rely on implicit ordering of IORING_OP_ and explicitly place them at a right place in io_op_defs. Now former comments are now a part of the code and won't ever outdate. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:07 -07:00
Pavel Begunkov	6b47ee6eca	io_uring: optimise sqe-to-req flags translation For each IOSQE_* flag there is a corresponding REQ_F_* flag. And there is a repetitive pattern of their translation: e.g. if (sqe->flags & SQE_FLAG) req->flags \|= REQ_F_FLAG Use same numeric values/bits for them and copy instead of manual handling. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:07 -07:00
Pavel Begunkov	87987898a1	io_uring: remove REQ_F_IO_DRAINED A request can get into the defer list only once, there is no need for marking it as drained, so remove it. This probably was left after extracting __need_defer() for use in timeouts. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:07 -07:00
Jens Axboe	e46a7950d3	io_uring: file switch work needs to get flushed on exit We currently flush early, but if we have something in progress and a new switch is scheduled, we need to ensure to flush after our teardown as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:07 -07:00
Pavel Begunkov	b14cca0c84	io_uring: hide uring_fd in ctx req->ring_fd and req->ring_file are used only during the prep stage during submission, which is is protected by mutex. There is no need to store them per-request, place them in ctx. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:06 -07:00
Pavel Begunkov	0791015837	io_uring: remove extra check in __io_commit_cqring __io_commit_cqring() is almost always called when there is a change in the rings, so the check is rather pessimising. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:06 -07:00
Pavel Begunkov	711be0312d	io_uring: optimise use of ctx->drain_next Move setting ctx->drain_next to the only place it could be set, when it got linked non-head requests. The same for checking it, it's interesting only for a head of a link or a non-linked request. No functional changes here. This removes some code from the common path and also removes REQ_F_DRAIN_LINK flag, as it doesn't need it anymore. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:06 -07:00
Jens Axboe	66f4af93da	io_uring: add support for probing opcodes The application currently has no way of knowing if a given opcode is supported or not without having to try and issue one and see if we get -EINVAL or not. And even this approach is fraught with peril, as maybe we're getting -EINVAL due to some fields being missing, or maybe it's just not that easy to issue that particular command without doing some other leg work in terms of setup first. This adds IORING_REGISTER_PROBE, which fills in a structure with info on what it supported or not. This will work even with sparse opcode fields, which may happen in the future or even today if someone backports specific features to older kernels. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:06 -07:00
Jens Axboe	10fef4bebf	io_uring: account fixed file references correctly in batch We can't assume that the whole batch has fixed files in it. If it's a mix, or none at all, then we can end up doing a ref put that either messes up accounting, or causes an oops if we have no fixed files at all. Also ensure we free requests properly between inflight accounted and normal requests. Fixes: 82c721577011 ("io_uring: extend batch freeing to cover more cases") Reported-by: Dmitrii Dolgov <9erthalion6@gmail.com> Reported-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Dmitrii Dolgov <9erthalion6@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:06 -07:00
Jens Axboe	354420f705	io_uring: add opcode to issue trace event For some test apps at least, user_data is just zeroes. So it's not a good way to tell what the command actually is. Add the opcode to the issue trace point. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:06 -07:00
Jens Axboe	cebdb98617	io_uring: add support for IORING_OP_OPENAT2 Add support for the new openat2(2) system call. It's trivial to do, as we can have openat(2) just be wrapped around it. Suggested-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Jens Axboe	f8748881b1	io_uring: remove 'fname' from io_open structure We only use it internally in the prep functions for both statx and openat, so we don't need it to be persistent across the request. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Jens Axboe	c12cedf24e	io_uring: add 'struct open_how' to the openat request context We'll need this for openat2(2) support, remove flags and mode from the existing io_open struct. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Jens Axboe	f2842ab5b7	io_uring: enable option to only trigger eventfd for async completions If an application is using eventfd notifications with poll to know when new SQEs can be issued, it's expecting the following read/writes to complete inline. And with that, it knows that there are events available, and don't want spurious wakeups on the eventfd for those requests. This adds IORING_REGISTER_EVENTFD_ASYNC, which works just like IORING_REGISTER_EVENTFD, except it only triggers notifications for events that happen from async completions (IRQ, or io-wq worker completions). Any completions inline from the submission itself will not trigger notifications. Suggested-by: Mark Papadakis <markuspapadakis@icloud.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Jens Axboe	69b3e54613	io_uring: change io_ring_ctx bool fields into bit fields In preparation for adding another one, which would make us spill into another long (and hence bump the size of the ctx), change them to bit fields. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Jens Axboe	c150368b49	io_uring: file set registration should use interruptible waits If an application attempts to register a set with unbounded requests pending, we can be stuck here forever if they don't complete. We can make this wait interruptible, and just abort if we get signaled. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
YueHaibing	96fd84d83a	io_uring: Remove unnecessary null check Null check kfree is redundant, so remove it. This is detected by coccinelle. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Jens Axboe	fddafacee2	io_uring: add support for send(2) and recv(2) This adds IORING_OP_SEND for send(2) support, and IORING_OP_RECV for recv(2) support. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Pavel Begunkov	2550878f84	io_uring: remove extra io_wq_current_is_worker() io_wq workers use io_issue_sqe() to forward sqes and never io_queue_sqe(). Remove extra check for io_wq_current_is_worker() Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Pavel Begunkov	caf582c652	io_uring: optimise commit_sqring() for common case It should be pretty rare to not submitting anything when there is something in the ring. No need to keep heuristics for this case. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Pavel Begunkov	ee7d46d9db	io_uring: optimise head checks in io_get_sqring() A user may ask to submit more than there is in the ring, and then io_uring will submit as much as it can. However, in the last iteration it will allocate an io_kiocb and immediately free it. It could do better and adjust @to_submit to what is in the ring. And since the ring's head is already checked here, there is no need to do it in the loop, spamming with smp_load_acquire()'s barriers Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:02 -07:00
Pavel Begunkov	9ef4f12489	io_uring: clamp to_submit in io_submit_sqes() Make io_submit_sqes() to clamp @to_submit itself. It removes duplicated code and prepares for following changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:02 -07:00
Jens Axboe	8110c1a621	io_uring: add support for IORING_SETUP_CLAMP Some applications like to start small in terms of ring size, and then ramp up as needed. This is a bit tricky to do currently, since we don't advertise the max ring size. This adds IORING_SETUP_CLAMP. If set, and the values for SQ or CQ ring size exceed what we support, then clamp them at the max values instead of returning -EINVAL. Since we return the chosen ring sizes after setup, no further changes are needed on the application side. io_uring already changes the ring sizes if the application doesn't ask for power-of-two sizes, for example. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:02 -07:00
Jens Axboe	c6ca97b30c	io_uring: extend batch freeing to cover more cases Currently we only batch free if fixed files are used, no links, no aux data, etc. This extends the batch freeing to only exclude the linked case and fallback case, and make io_free_req_many() handle the other cases just fine. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:02 -07:00
Jens Axboe	8237e04598	io_uring: wrap multi-req freeing in struct req_batch This cleans up the code a bit, and it allows us to build on top of the multi-req freeing. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:02 -07:00
Pavel Begunkov	2b85edfc0c	io_uring: batch getting pcpu references percpu_ref_tryget() has its own overhead. Instead getting a reference for each request, grab a bunch once per io_submit_sqes(). ~5% throughput boost for a "submit and wait 128 nops" benchmark. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> __io_req_free_empty() -> __io_req_do_free() Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:02 -07:00
Jens Axboe	c1ca757bd6	io_uring: add IORING_OP_MADVISE This adds support for doing madvise(2) through io_uring. We assume that any operation can block, and hence punt everything async. This could be improved, but hard to make bullet proof. The async punt ensures it's safe. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:02 -07:00
Jens Axboe	4840e418c2	io_uring: add IORING_OP_FADVISE This adds support for doing fadvise through io_uring. We assume that WILLNEED doesn't block, but that DONTNEED may block. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:01 -07:00
Jens Axboe	ba04291eb6	io_uring: allow use of offset == -1 to mean file position This behaves like preadv2/pwritev2 with offset == -1, it'll use (and update) the current file position. This obviously comes with the caveat that if the application has multiple read/writes in flight, then the end result will not be as expected. This is similar to threads sharing a file descriptor and doing IO using the current file position. Since this feature isn't easily detectable by doing a read or write, add a feature flags, IORING_FEAT_RW_CUR_POS, to allow applications to detect presence of this feature. Reported-by: 李通洲 <carter.li@eoitek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Jens Axboe	3a6820f2bb	io_uring: add non-vectored read/write commands For uses cases that don't already naturally have an iovec, it's easier (or more convenient) to just use a buffer address + length. This is particular true if the use case is from languages that want to create a memory safe abstraction on top of io_uring, and where introducing the need for the iovec may impose an ownership issue. For those cases, they currently need an indirection buffer, which means allocating data just for this purpose. Add basic read/write that don't require the iovec. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Jens Axboe	e94f141bd2	io_uring: improve poll completion performance For busy IORING_OP_POLL_ADD workloads, we can have enough contention on the completion lock that we fail the inline completion path quite often as we fail the trylock on that lock. Add a list for deferred completions that we can use in that case. This helps reduce the number of async offloads we have to do, as if we get multiple completions in a row, we'll piggy back on to the poll_llist instead of having to queue our own offload. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Jens Axboe	ad3eb2c89f	io_uring: split overflow state into SQ and CQ side We currently check ->cq_overflow_list from both SQ and CQ context, which causes some bouncing of that cache line. Add separate bits of state for this instead, so that the SQ side can check using its own state, and likewise for the CQ side. This adds ->sq_check_overflow with the SQ state, and ->cq_check_overflow with the CQ state. If we hit an overflow condition, both of these bits are set. Likewise for overflow flush clear, we clear both bits. For the fast path of just checking if there's an overflow condition on either the SQ or CQ side, we can use our own private bit for this. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Jens Axboe	d3656344fe	io_uring: add lookup table for various opcode needs We currently have various switch statements that check if an opcode needs a file, mm, etc. These are hard to keep in sync as opcodes are added. Add a struct io_op_def that holds all of this information, so we have just one spot to update when opcodes are added. This also enables us to NOT allocate req->io if a deferred command doesn't need it, and corrects some mistakes we had in terms of what commands need mm context. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Jens Axboe	add7b6b85a	io_uring: remove two unnecessary function declarations __io_free_req() and io_double_put_req() aren't used before they are defined, so we can kill these two forwards. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Pavel Begunkov	32fe525b6d	io_uring: move *queue_link_head() from common path Move io_queue_link_head() to links handling code in io_submit_sqe(), so it wouldn't need extra checks and would have better data locality. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Pavel Begunkov	9d76377f7e	io_uring: rename prev to head Calling "prev" a head of a link is a bit misleading. Rename it Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Jens Axboe	ce35a47a3a	io_uring: add IOSQE_ASYNC io_uring defaults to always doing inline submissions, if at all possible. But for larger copies, even if the data is fully cached, that can take a long time. Add an IOSQE_ASYNC flag that the application can set on the SQE - if set, it'll ensure that we always go async for those kinds of requests. Use the io-wq IO_WQ_WORK_CONCURRENT flag to ensure we get the concurrency we desire for this case. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Jens Axboe	895e2ca0f6	io-wq: support concurrent non-blocking work io-wq assumes that work will complete fast (and not block), so it doesn't create a new worker when work is enqueued, if we already have at least one worker running. This is done on the assumption that if work is running, then it will complete fast. Add an option to force io-wq to fork a new worker for work queued. This is signaled by setting IO_WQ_WORK_CONCURRENT on the work item. For that case, io-wq will create a new worker, even though workers are already running. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Jens Axboe	eddc7ef52a	io_uring: add support for IORING_OP_STATX This provides support for async statx(2) through io_uring. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:54 -07:00
Jens Axboe	3934e36f60	fs: make two stat prep helpers available To implement an async stat, we need to provide the flags mapping and the statx user copy. Make them available internally, through fs/internal.h. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:54 -07:00
Jens Axboe	05f3fb3c53	io_uring: avoid ring quiesce for fixed file set unregister and update We currently fully quiesce the ring before an unregister or update of the fixed fileset. This is very expensive, and we can be a bit smarter about this. Add a percpu refcount for the file tables as a whole. Grab a percpu ref when we use a registered file, and put it on completion. This is cheap to do. Upon removal of a file from a set, switch the ref count to atomic mode. When we hit zero ref on the completion side, then we know we can drop the previously registered files. When the old files have been dropped, switch the ref back to percpu mode for normal operation. Since there's a period between doing the update and the kernel being done with it, add a IORING_OP_FILES_UPDATE opcode that can perform the same action. The application knows the update has completed when it gets the CQE for it. Between doing the update and receiving this completion, the application must continue to use the unregistered fd if submitting IO on this particular file. This takes the runtime of test/file-register from liburing from 14s to about 0.7s. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:50 -07:00
Jens Axboe	b5dba59e0c	io_uring: add support for IORING_OP_CLOSE This works just like close(2), unsurprisingly. We remove the file descriptor and post the completion inline, then offload the actual (potential) last file put to async context. Mark the async part of this work as uncancellable, as we really must guarantee that the latter part of the close is run. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:01:53 -07:00
Jens Axboe	0c9d5ccd26	io-wq: add support for uncancellable work Not all work can be cancelled, some of it we may need to guarantee that it runs to completion. Allow the caller to set IO_WQ_WORK_NO_CANCEL on work that must not be cancelled. Note that the caller work function must also check for IO_WQ_WORK_NO_CANCEL on work that is marked IO_WQ_WORK_CANCEL. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:01:53 -07:00
Jens Axboe	6e802a4ba0	fs: move filp_close() outside of __close_fd_get_file() Just one caller of this, and just use filp_close() there manually. This is important to allow async close/removal of the fd. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:01:53 -07:00
Jens Axboe	15b71abe7b	io_uring: add support for IORING_OP_OPENAT This works just like openat(2), except it can be performed async. For the normal case of a non-blocking path lookup this will complete inline. If we have to do IO to perform the open, it'll be done from async context. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:01:53 -07:00
Jens Axboe	35cb6d54c1	fs: make build_open_flags() available internally This is a prep patch for supporting non-blocking open from io_uring. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:01:53 -07:00
Jens Axboe	d63d1b5edb	io_uring: add support for fallocate() This exposes fallocate(2) through io_uring. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:01:53 -07:00
Jens Axboe	4d92748373	Merge branch 'io_uring-5.5' into for-5.6/io_uring-vfs Pull in compatability fix for the files_update command. * io_uring-5.5: io_uring: fix compat for IORING_REGISTER_FILES_UPDATE	2020-01-20 17:01:17 -07:00
Eugene Syromiatnikov	1292e972ff	io_uring: fix compat for IORING_REGISTER_FILES_UPDATE fds field of struct io_uring_files_update is problematic with regards to compat user space, as pointer size is different in 32-bit, 32-on-64-bit, and 64-bit user space. In order to avoid custom handling of compat in the syscall implementation, make fds __u64 and use u64_to_user_ptr in order to retrieve it. Also, align the field naturally and check that no garbage is passed there. Fixes: `c3a31e6056` ("io_uring: add support for IORING_REGISTER_FILES_UPDATE") Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:00:44 -07:00
Linus Torvalds	d96d875ef5	\n -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl4laHsACgkQnJ2qBz9k QNnyrQf/W6+QU/BTP0K478PvXOzglNz/UdJf8Y5bzr11Cpx9oV4Mh5MdePViyo+Q +pqVmNmNsSpFoTt4K+b/TkU8z81jB+uYnxYZgFUUVrpKh1913EriGjna0F94ZlvL b607o6cfq79J6w8Ddf64Yq415328syxVsmZiK03T04ENHHcSW1zuBiuG2iO1l4lt SM+QNfSgJe33+fRcbI9Rr7Pywhm1FYZ6EIrymTeWZTGDuU8tN0o3m5vJ9Y0AuHsf u8V/3TX2bWI/TVFWtFzOQvhq2cCVATmBesgRzaPO7brNMvyGjAvtg/gGSfnPaWPs ZOSuUuIp2aL5Z4I5ZAwca0lHErkV8w== =b/h6 -----END PGP SIGNATURE----- Merge tag 'fixes_for_v5.5-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull reiserfs fix from Jan Kara: "A fixup of a recently merged reiserfs fix which has caused problem when xattrs were not compiled in" * tag 'fixes_for_v5.5-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: reiserfs: fix handling of -EOPNOTSUPP in reiserfs_for_each_xattr	2020-01-20 11:24:13 -08:00
Eric Biggers	50d9fad73a	ubifs: use IS_ENCRYPTED() instead of ubifs_crypt_is_encrypted() There's no need for the ubifs_crypt_is_encrypted() function anymore. Just use IS_ENCRYPTED() instead, like ext4 and f2fs do. IS_ENCRYPTED() checks the VFS-level flag instead of the UBIFS-specific flag, but it shouldn't change any behavior since the flags are kept in sync. Link: https://lore.kernel.org/r/20191209212721.244396-1-ebiggers@kernel.org Acked-by: Richard Weinberger <richard@nod.at> Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-01-20 10:43:46 -08:00
Anand Jain	a69976bc69	btrfs: device stats, log when stats are zeroed We had a report indicating that some read errors aren't reported by the device stats in the userland. It is important to have the errors reported in the device stat as user land scripts might depend on it to take the reasonable corrective actions. But to debug these issue we need to be really sure that request to reset the device stat did not come from the userland itself. So log an info message when device error reset happens. For example: BTRFS info (device sdc): device stats zeroed by btrfs(9223) Reported-by: philip@philip-seeger.de Link: https://www.spinics.net/lists/linux-btrfs/msg96528.html Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:41:02 +01:00
Josef Bacik	556755a8a9	btrfs: fix improper setting of scanned for range cyclic write cache pages We noticed that we were having regular CG OOM kills in cases where there was still enough dirty pages to avoid OOM'ing. It turned out there's this corner case in btrfs's handling of range_cyclic where files that were being redirtied were not getting fully written out because of how we do range_cyclic writeback. We unconditionally were setting scanned = 1; the first time we found any pages in the inode. This isn't actually what we want, we want it to be set if we've scanned the entire file. For range_cyclic we could be starting in the middle or towards the end of the file, so we could write one page and then not write any of the other dirty pages in the file because we set scanned = 1. Fix this by not setting scanned = 1 if we find pages. The rules for setting scanned should be 1) !range_cyclic. In this case we have a specified range to write out. 2) range_cyclic && index == 0. In this case we've started at the beginning and there is no need to loop around a second time. 3) range_cyclic && we started at index > 0 and we've reached the end of the file without satisfying our nr_to_write. This patch fixes both of our writepages implementations to make sure these rules hold true. This fixed our over zealous CG OOMs in production. Fixes: `d1310b2e0c` ("Btrfs: Split the extent_map code into two parts") Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:41:02 +01:00
David Sterba	4babad1019	btrfs: safely advance counter when looking up bio csums Dan's smatch tool reports fs/btrfs/file-item.c:295 btrfs_lookup_bio_sums() warn: should this be 'count == -1' which points to the while (count--) loop. With count == 0 the check itself could decrement it to -1. There's a WARN_ON a few lines below that has never been seen in practice though. It turns out that the value of page_bytes_left matches the count (by sectorsize multiples). The loop never reaches the state where count would go to -1, because page_bytes_left == 0 is found first and this breaks out. For clarity, use only plain check on count (and only for positive value), decrement safely inside the loop. Any other discrepancy after the whole bio list processing should be reported by the exising WARN_ON_ONCE as well. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:41:01 +01:00
David Sterba	94f8c46566	btrfs: remove unused member btrfs_device::work This is a leftover from recently removed bio scheduling framework. Fixes: `ba8a9d0795` ("Btrfs: delete the entire async bio submission framework") Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:41:01 +01:00
Johannes Thumshirn	ef0a82da81	btrfs: remove unnecessary wrapper get_alloc_profile btrfs_get_alloc_profile() is a simple wrapper over get_alloc_profile(). The only difference is btrfs_get_alloc_profile() is visible to other functions in btrfs while get_alloc_profile() is static and thus only visible to functions in block-group.c. Let's just fold get_alloc_profile() into btrfs_get_alloc_profile() to get rid of the unnecessary second function. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <jth@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:41:01 +01:00
Dennis Zhou	81b29a3bf7	btrfs: add correction to handle -1 edge case in async discard From Dave's testing described below, it's possible to drive a file system to have bogus values of discardable_extents and _bytes. As btrfs_discard_calc_delay() is the only user of discardable_extents, we can correct here for any negative discardable_extents/discardable_bytes. The problem is not reliably reproducible. The workload that created it was based on linux git tree, switching between release tags, then everytihng deleted followed by a full rebalance. At this state the values of discardable_bytes was 16K and discardable_extents was -1, expected values 0 and 0. Repeating the workload again did not correct the bogus values so the offset seems to be stable once it happens. Reported-by: David Sterba <dsterba@suse.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:41:01 +01:00
Dennis Zhou	27f0afc737	btrfs: ensure removal of discardable_* in free_bitmap() Most callers of free_bitmap() only call it if bitmap_info->bytes is 0. However, there are certain cases where we may free the free space cache via __btrfs_remove_free_space_cache(). This exposes a path where free_bitmap() is called regardless. This may result in a bad accounting situation for discardable_bytes and discardable_extents. So, remove the stats and call btrfs_discard_update_discardable(). Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:41:01 +01:00
Dennis Zhou	f9bb615af2	btrfs: make smaller extents more likely to go into bitmaps It's less than ideal for small extents to eat into our extent budget, so force extents <= 32KB into the bitmaps save for the first handful. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:41:00 +01:00
Dennis Zhou	5d90c5c757	btrfs: increase the metadata allowance for the free_space_cache Currently, there is no way for the free space cache to recover from being serviced by purely bitmaps because the extent threshold is set to 0 in recalculate_thresholds() when we surpass the metadata allowance. This adds a recovery mechanism by keeping large extents out of the bitmaps and increases the metadata upper bound to 64KB. The recovery mechanism bypasses this upper bound, thus making it a soft upper bound. But, with the bypass being 1MB or greater, it shouldn't add unbounded overhead. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:41:00 +01:00
Dennis Zhou	dbc2a8c927	btrfs: add async discard implementation overview Give a brief overview for how async discard is implemented. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:41:00 +01:00
Dennis Zhou	9ddf648f9c	btrfs: keep track of discard reuse stats Keep track of how much we are discarding and how often we are reusing with async discard. The discard_*_bytes values don't need any special protection because the work item provides the single threaded access. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:41:00 +01:00
Dennis Zhou	5cb0724e1b	btrfs: only keep track of data extents for async discard As mentioned earlier, discarding data can be done either by issuing an explicit discard or implicitly by reusing the LBA. Metadata block_groups see much more frequent reuse due to well it being metadata. So instead of explicitly discarding metadata block_groups, just leave them be and let the latter implicit discarding be done for them. For mixed block_groups, block_groups which contain both metadata and data, we let them be as higher fragmentation is expected. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:41:00 +01:00
Dennis Zhou	7fe6d45e40	btrfs: have multiple discard lists Non-block group destruction discarding currently only had a single list with no minimum discard length. This can lead to caravaning more meaningful discards behind a heavily fragmented block group. This adds support for multiple lists with minimum discard lengths to prevent the caravan effect. We promote block groups back up when we exceed the BTRFS_ASYNC_DISCARD_MAX_FILTER size, currently we support only 2 lists with filters of 1MB and 32KB respectively. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:41:00 +01:00
Dennis Zhou	19b2a2c719	btrfs: make max async discard size tunable Expose max_discard_size as a tunable via sysfs and switch the current fixed maximum to the default value. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:59 +01:00
Dennis Zhou	4aa9ad5203	btrfs: limit max discard size for async discard Throttle the maximum size of a discard so that we can provide an upper bound for the rate of async discard. While the block layer is able to split discards into the appropriate sized discards, we want to be able to account more accurately the rate at which we are consuming NCQ slots as well as limit the upper bound of work for a discard. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:59 +01:00
Dennis Zhou	e93591bb6e	btrfs: add kbps discard rate limit for async discard Provide the ability to rate limit based on kbps in addition to iops as additional guides for the target discard rate. The delay used ends up being max(kbps_delay, iops_delay). Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:59 +01:00
Dennis Zhou	a230930084	btrfs: calculate discard delay based on number of extents An earlier patch keeps track of discardable_extents. These are undiscarded extents managed by the free space cache. Here, we will use this to dynamically calculate the discard delay interval. There are 3 rate to consider. The first is the target convergence rate, the rate to discard all discardable_extents over the BTRFS_DISCARD_TARGET_MSEC time frame. This is clamped by the lower limit, the iops limit or BTRFS_DISCARD_MIN_DELAY (1ms), and the upper limit, BTRFS_DISCARD_MAX_DELAY (1s). We reevaluate this delay every transaction commit. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:59 +01:00
Dennis Zhou	5dc7c10b87	btrfs: keep track of discardable_bytes for async discard Keep track of this metric so that we can understand how ahead or behind we are in discarding rate. This uses the same accounting method as discardable_extents, deltas between previous/current values and propagating them up. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:59 +01:00
Dennis Zhou	dfb79ddb13	btrfs: track discardable extents for async discard The number of discardable extents will serve as the rate limiting metric for how often we should discard. This keeps track of discardable extents in the free space caches by maintaining deltas and propagating them to the global count. The deltas are calculated from 2 values stored in PREV and CURR entries, then propagated up to the global discard ctl. The current counter value becomes the previous counter value after update. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:58 +01:00
Dennis Zhou	e4faab844a	btrfs: sysfs: add UUID/debug/discard directory Setup base sysfs directory for discard stats + tunables. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:58 +01:00
Dennis Zhou	93945cb43e	btrfs: sysfs: make UUID/debug have its own kobject Btrfs only allowed attributes to be exposed in debug/. Let's let other groups be created by making debug its own kobject. This also makes the per-fs debug options separate from the global features mount attributes. This seems to be needed as sysfs_create_files() requires const struct attribute * while sysfs_create_group() can take struct attribute *. This seems nicer as per file system, you'll probably use to_fs_info(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:58 +01:00
Dennis Zhou	71e8978eb4	btrfs: sysfs: add removal calls for debug/ We probably should call sysfs_remove_group() on debug/. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:58 +01:00
Dennis Zhou	2bee7eb8bb	btrfs: discard one region at a time in async discard The prior two patches added discarding via a background workqueue. This just piggybacked off of the fstrim code to trim the whole block at once. Well inevitably this is worse performance wise and will aggressively overtrim. But it was nice to plumb the other infrastructure to keep the patches easier to review. This adds the real goal of this series which is discarding slowly (ie. a slow long running fstrim). The discarding is split into two phases, extents and then bitmaps. The reason for this is two fold. First, the bitmap regions overlap the extent regions. Second, discarding the extents first will let the newly trimmed bitmaps have the highest chance of coalescing when being readded to the free space cache. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:58 +01:00
Dennis Zhou	6e80d4f8c4	btrfs: handle empty block_group removal for async discard block_group removal is a little tricky. It can race with the extent allocator, the cleaner thread, and balancing. The current path is for a block_group to be added to the unused_bgs list. Then, when the cleaner thread comes around, it starts a transaction and then proceeds with removing the block_group. Extents that are pinned are subsequently removed from the pinned trees and then eventually a discard is issued for the entire block_group. Async discard introduces another player into the game, the discard workqueue. While it has none of the racing issues, the new problem is ensuring we don't leave free space untrimmed prior to forgetting the block_group. This is handled by placing fully free block_groups on a separate discard queue. This is necessary to maintain discarding order as in the future we will slowly trim even fully free block_groups. The ordering helps us make progress on the same block_group rather than say the last fully freed block_group or needing to search through the fully freed block groups at the beginning of a list and insert after. The new order of events is a fully freed block group gets placed on the unused discard queue first. Once it's processed, it will be placed on the unusued_bgs list and then the original sequence of events will happen, just without the final whole block_group discard. The mount flags can change when processing unused_bgs, so when flipping from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt free block groups on the discard_list to the unused_bg queue which will do the final discard for us. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:57 +01:00
Dennis Zhou	b0643e59cf	btrfs: add the beginning of async discard, discard workqueue When discard is enabled, everytime a pinned extent is released back to the block_group's free space cache, a discard is issued for the extent. This is an overeager approach when it comes to discarding and helping the SSD maintain enough free space to prevent severe garbage collection situations. This adds the beginning of async discard. Instead of issuing a discard prior to returning it to the free space, it is just marked as untrimmed. The block_group is then added to a LRU which then feeds into a workqueue to issue discards at a much slower rate. Full discarding of unused block groups is still done and will be addressed in a future patch of the series. For now, we don't persist the discard state of extents and bitmaps. Therefore, our failure recovery mode will be to consider extents untrimmed. This lets us handle failure and unmounting as one in the same. On a number of Facebook webservers, I collected data every minute accounting the time we spent in btrfs_finish_extent_commit() (col. 1) and in btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit() is where we discard extents synchronously before returning them to the free space cache. discard=sync: p99 total per minute p99 total per minute Drive \| extent_commit() (ms) \| commit_trans() (ms) --------------------------------------------------------------- Drive A \| 434 \| 1170 Drive B \| 880 \| 2330 Drive C \| 2943 \| 3920 Drive D \| 4763 \| 5701 discard=async: p99 total per minute p99 total per minute Drive \| extent_commit() (ms) \| commit_trans() (ms) -------------------------------------------------------------- Drive A \| 134 \| 956 Drive B \| 64 \| 1972 Drive C \| 59 \| 1032 Drive D \| 62 \| 1200 While it's not great that the stats are cumulative over 1m, all of these servers are running the same workload and and the delta between the two are substantial. We are spending significantly less time in btrfs_finish_extent_commit() which is responsible for discarding. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:57 +01:00
Dennis Zhou	da080fe1ba	btrfs: keep track of free space bitmap trim status cleanliness There is a cap in btrfs in the amount of free extents that a block group can have. When it surpasses that threshold, future extents are placed into bitmaps. Instead of keeping track of if a certain bit is trimmed or not in a second bitmap, keep track of the relative state of the bitmap. With async discard, trimming bitmaps becomes a more frequent operation. As a trade off with simplicity, we keep track of if discarding a bitmap is in progress. If we fully scan a bitmap and trim as necessary, the bitmap is marked clean. This has some caveats as the min block size may skip over regions deemed too small. But this should be a reasonable trade off rather than keeping a second bitmap and making allocation paths more complex. The downside is we may overtrim, but ideally the min block size should prevent us from doing that too often and getting stuck trimming pathological cases. BTRFS_TRIM_STATE_TRIMMING is added to indicate a bitmap is in the process of being trimmed. If additional free space is added to that bitmap, the bit is cleared. A bitmap will be marked BTRFS_TRIM_STATE_TRIMMED if the trimming code was able to reach the end of it and the former is still set. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:57 +01:00
Dennis Zhou	a7ccb25585	btrfs: keep track of which extents have been discarded Async discard will use the free space cache as backing knowledge for which extents to discard. This patch plumbs knowledge about which extents need to be discarded into the free space cache from unpin_extent_range(). An untrimmed extent can merge with everything as this is a new region. Absorbing trimmed extents is a tradeoff to for greater coalescing which makes life better for find_free_extent(). Additionally, it seems the size of a trim isn't as problematic as the trim io itself. When reading in the free space cache from disk, if sync is set, mark all extents as trimmed. The current code ensures at transaction commit that all free space is trimmed when sync is set, so this reflects that. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:57 +01:00
Dennis Zhou	46b27f5059	btrfs: rename DISCARD mount option to to DISCARD_SYNC This series introduces async discard which will use the flag DISCARD_ASYNC, so rename the original flag to DISCARD_SYNC as it is synchronously done in transaction commit. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:57 +01:00
Qu Wenruo	147a097cf0	btrfs: tree-checker: Verify location key for DIR_ITEM/DIR_INDEX [PROBLEM] There is a user report in the mail list, showing the following corrupted tree blocks: item 62 key (486836 DIR_ITEM 2543451757) itemoff 6273 itemsize 74 location key (4065004 INODE_ITEM 1073741824) type FILE transid 21397 data_len 0 name_len 44 name: FILENAME Note that location key, its offset should be 0 for all INODE_ITEMS. This caused failed lookup of the inode. [CAUSE] That offending value, 1073741824, is 0x40000000. So this looks like a memory bit flip. [FIX] This patch will enhance tree-checker to check location key of DIR_INDEX/DIR_ITEM/XATTR_ITEM. There are several different combinations needs to check: - item_key.type == DIR_INDEX/DIR_ITEM * location_key.type == BTRFS_INODE_ITEM_KEY This location_key should follow the check in inode_item check. * location_key.type == BTRFS_ROOT_ITEM_KEY Despite the existing check, DIR_INDEX/DIR_ITEM can only points to subvolume trees. * All other keys are not allowed. - item_key.type == XATTR_ITEM location_key should be all 0. Reported-by: Mike Gilbert <floppymaster@gmail.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:56 +01:00
Qu Wenruo	57a0e67491	btrfs: tree-checker: Refactor root key check into separate function ROOT_ITEM key check itself is not as simple as single line check, and will be reused for both ROOT_ITEM and DIR_ITEM/DIR_INDEX location key check, so refactor such check into check_root_key(). Also since we are here, fix a comment error about ROOT_ITEM offset, which is transid of snapshot creation, not some "older kernel behavior". Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:56 +01:00
Qu Wenruo	c23c77b097	btrfs: tree-checker: Refactor inode key check into seperate function Inode key check is not as easy as several lines, and it will be called in more than one location (INODE_ITEM check and DIR_ITEM/DIR_INDEX/XATTR_ITEM location key check). So here refactor such check into check_inode_key(). And add extra checks for XATTR_ITEM. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:56 +01:00
Qu Wenruo	c3053ebb0b	btrfs: tree-checker: Clean up fs_info parameter from error message wrapper The @fs_info parameter can be extracted from extent_buffer structure, and there are already some wrappers getting rid of the @fs_info parameter. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:56 +01:00
Qu Wenruo	f6d2a5c263	btrfs: tree-checker: Check leaf chunk item size Inspired by btrfs-progs github issue #208, where chunk item in chunk tree has invalid num_stripes (0). Although that can already be caught by current btrfs_check_chunk_valid(), that function doesn't really check item size as it needs to handle chunk item in super block sys_chunk_array(). This patch will add two extra checks for chunk items in chunk tree: - Basic chunk item size If the item is smaller than btrfs_chunk (which already contains one stripe), exit right now as reading num_stripes may even go beyond eb boundary. - Item size check against num_stripes If item size doesn't match with calculated chunk size, then either the item size or the num_stripes is corrupted. Error out anyway. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:56 +01:00
zhengbin	0ab575c5df	btrfs: Remove unneeded semicolon Fixes coccicheck warning: fs/btrfs/print-tree.c:320:3-4: Unneeded semicolon Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:55 +01:00
Omar Sandoval	95690e58e1	btrfs: remove struct find_free_extent.ram_bytes This hasn't been used since it was first introduced in commit `b4bd745d12` ("btrfs: Introduce find_free_extent_ctl structure for later rework"). Passing that to btrfs_add_reserved_bytes in find_free_extent is not strictly necessary and using the local ram_bytes instead seems cleaner. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:55 +01:00
Omar Sandoval	c8b04030c5	btrfs: simplify compressed/inline check in __extent_writepage_io() Commit `7087a9d8db` ("btrfs: Remove extent_io_ops::writepage_end_io_hook") left this logic in a confusing state. Simplify it. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:55 +01:00
Omar Sandoval	39b07b5d70	btrfs: drop create parameter to btrfs_get_extent() We only pass this as 1 from __extent_writepage_io(). The parameter basically means "pretend I didn't pass in a page". This is silly since we can simply not pass in the page. Get rid of the parameter from btrfs_get_extent(), and since it's used as a get_extent_t callback, remove it from get_extent_t and btree_get_extent(), neither of which need it. While we're here, let's document btrfs_get_extent(). Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:55 +01:00
Omar Sandoval	f95d713b54	btrfs: remove redundant i_size check in __extent_writepage_io() In __extent_writepage_io(), we check whether i_size <= page_offset(page). Note that if i_size < page_offset(page), then i_size >> PAGE_SHIFT < page->index. If i_size == page_offset(page), then i_size >> PAGE_SHIFT == page->index && offset_in_page(i_size) == 0. __extent_writepage() already has a check for these cases that returns without calling __extent_writepage_io(): end_index = i_size >> PAGE_SHIFT pg_offset = offset_in_page(i_size); if (page->index > end_index \|\| (page->index == end_index && !pg_offset)) { page->mapping->a_ops->invalidatepage(page, 0, PAGE_SIZE); unlock_page(page); return 0; } Get rid of the one in __extent_writepage_io(), which was obsoleted in `211c17f51f` ("Fix corners in writepage and btrfs_truncate_page"). Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:55 +01:00
Omar Sandoval	169d2c875e	btrfs: remove trivial goto label in __extent_writepage() Since `40f765805f` ("Btrfs: split up __extent_writepage to lower stack usage"), done_unlocked is simply a return 0. Get rid of it. Mid-statement block returns don seem to make the code less readable here. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:54 +01:00
Omar Sandoval	eb70d22263	btrfs: remove unnecessary pg_offset assignments in __extent_writepage() We're initializing pg_offset to 0, setting it immediately, then reassigning it to 0 again after. The former became unnecessary in `211c17f51f` ("Fix corners in writepage and btrfs_truncate_page"). The latter is a leftover that should've been removed in `40f765805f` ("Btrfs: split up __extent_writepage to lower stack usage"). Remove both. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:54 +01:00
Omar Sandoval	bffe633e00	btrfs: make btrfs_ordered_extent naming consistent with btrfs_file_extent_item ordered->start, ordered->len, and ordered->disk_len correspond to fi->disk_bytenr, fi->num_bytes, and fi->disk_num_bytes, respectively. It's confusing to translate between the two naming schemes. Since a btrfs_ordered_extent is basically a pending btrfs_file_extent_item, let's make the former use the naming from the latter. Note that I didn't touch the names in tracepoints just in case there are scripts depending on the current naming. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:54 +01:00
Omar Sandoval	313facc5bd	btrfs: remove dead snapshot-aware defrag code Snapshot-aware defrag has been disabled since commit `8101c8dbf6` ("Btrfs: disable snapshot aware defrag for now") almost 6 years ago. Let's remove the dead code. If someone is up to the task of bringing it back, they can dig it up from git. This is logically a revert of commit `38c227d87c` ("Btrfs: snapshot-aware defrag") except that now we have to clear the EXTENT_DEFRAG bit to avoid need_force_cow() returning true forever. The reasons to disable were caused by runtime problems (like long stalls or memory consumption) on heavily referenced extents (eg. thousands of snapshots). There were attempts to fix that but never finished. Current defrag breaks the extent references and some users prefer that behaviour over the one implemented by snapshot aware (ie. keeping links for defragmentation). To enable both usecases we'd need to extend defrag ioctl but let's do that properly from scratch. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> [ enhance ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:54 +01:00
Omar Sandoval	db72e47f79	btrfs: get rid of at_offset parameter to btrfs_lookup_bio_sums() We can encode this in the offset parameter: -1 means use the page offsets, anything else is a valid offset. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:54 +01:00
Omar Sandoval	e62958fce9	btrfs: get rid of trivial __btrfs_lookup_bio_sums() wrappers Currently, we have two wrappers for __btrfs_lookup_bio_sums(): btrfs_lookup_bio_sums_dio(), which is used for direct I/O, and btrfs_lookup_bio_sums(), which is used everywhere else. The only difference is that the _dio variant looks up csums starting at the given offset instead of using the page index, which isn't actually direct I/O-specific. Let's clean up the signature and return value of __btrfs_lookup_bio_sums(), rename it to btrfs_lookup_bio_sums(), and get rid of the trivial helpers. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:53 +01:00
Johannes Thumshirn	321f69f86a	btrfs: reset device back to allocation state when removing When closing a device, btrfs_close_one_device() first allocates a new device, copies the device to close's name, replaces it in the dev_list with the copy and then finally frees it. This involves two memory allocation, which can potentially fail. As this code path is tricky to unwind, the allocation failures where handled by BUG_ON()s. But this copying isn't strictly needed, all that is needed is resetting the device in question to it's state it had after the allocation. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:53 +01:00
Johannes Thumshirn	3fff3975a7	btrfs: decrement number of open devices after closing the device not before In btrfs_close_one_device we're decrementing the number of open devices before we're calling btrfs_close_bdev(). As there is no intermediate exit between these points in this function it is technically OK to do so, but it makes the code a bit harder to understand. Move both operations closer together and move the decrement step after btrfs_close_bdev(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:53 +01:00
Omar Sandoval	6bb6b51447	btrfs: use simple_dir_inode_operations for placeholder subvolume directory When you snapshot a subvolume containing a subvolume, you get a placeholder directory where the subvolume would be. These directories have their own btrfs_dir_ro_inode_operations. Al pointed out [1] that these directories can use simple_lookup() instead of btrfs_lookup(), as they are always empty. Furthermore, they can use the default generic_permission() instead of btrfs_permission(); the additional checks in the latter don't matter because we can't write to the directory anyways. Finally, they can use the default generic_update_time() instead of btrfs_update_time(), as the inode doesn't exist on disk and doesn't need any special handling. All together, this means that we can get rid of btrfs_dir_ro_inode_operations and use simple_dir_inode_operations instead. 1: https://lore.kernel.org/linux-btrfs/20190929052934.GY26530@ZenIV.linux.org.uk/ Cc: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:53 +01:00
Johannes Thumshirn	b38f4cbd65	btrfs: remove impossible WARN_ON in btrfs_destroy_dev_replace_tgtdev() We have a user report, that cppcheck is complaining about a possible NULL-pointer dereference in btrfs_destroy_dev_replace_tgtdev(). We're first dereferencing the 'tgtdev' variable and the later check for the validity of the pointer with a WARN_ON(!tgtdev); But all callers of btrfs_destroy_dev_replace_tgtdev() either explicitly check if 'tgtdev' is non-NULL or directly allocate 'tgtdev', so the WARN_ON() is impossible to hit. Just remove it to silence the checker's complains. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205003 Signed-off-by: Johannes Thumshirn <jth@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:53 +01:00
Johannes Thumshirn	1296995225	btrfs: remove superfluous BUG_ON() in integrity checks btrfsic_process_superblock() BUG_ON()s if 'state' is NULL. But this can never happen as the only caller from btrfsic_process_superblock() is btrfsic_mount() which allocates 'state' some lines above calling btrfsic_process_superblock() and checks for the allocation to succeed. Let's just remove the impossible to hit BUG_ON(). Signed-off-by: Johannes Thumshirn <jth@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:52 +01:00
Johannes Thumshirn	3dbd351df4	btrfs: fix possible NULL-pointer dereference in integrity checks A user reports a possible NULL-pointer dereference in btrfsic_process_superblock(). We are assigning state->fs_info to a local fs_info variable and afterwards checking for the presence of state. While we would BUG_ON() a NULL state anyways, we can also just remove the local fs_info copy, as fs_info is only used once as the first argument for btrfs_num_copies(). There we can just pass in state->fs_info as well. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205003 Signed-off-by: Johannes Thumshirn <jth@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:52 +01:00
Josef Bacik	f893556637	btrfs: kill min_allocable_bytes in inc_block_group_ro This is a relic from a time before we had a proper reservation mechanism and you could end up with really full chunks at chunk allocation time. This doesn't make sense anymore, so just kill it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:52 +01:00
Josef Bacik	9f246926b4	btrfs: don't pass system_chunk into can_overcommit We have the space_info, we can just check its flags to see if it's the system chunk space info. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:52 +01:00
Nikolay Borisov	511a32b549	btrfs: Opencode ordered_data_tree_panic It's a simple wrapper over btrfs_panic and is called only once. Just open code it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:52 +01:00
Qu Wenruo	430640e316	btrfs: relocation: Output current relocation stage at btrfs_relocate_block_group() There are two relocation stages but both print the same message. Add the description of the stage. This can help debugging or provides informative message to users. BTRFS info (device dm-5): balance: start -d -m -s BTRFS info (device dm-5): relocating block group 30408704 flags metadata\|dup BTRFS info (device dm-5): found 2 extents, stage: move data extents BTRFS info (device dm-5): relocating block group 22020096 flags system\|dup BTRFS info (device dm-5): found 1 extents, stage: move data extents BTRFS info (device dm-5): relocating block group 13631488 flags data BTRFS info (device dm-5): found 1 extents, stage: move data extents BTRFS info (device dm-5): found 1 extents, stage: update data pointers BTRFS info (device dm-5): balance: ended with status: 0 Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:51 +01:00
Yunfeng Ye	76de60ed04	btrfs: remove unused condition check in btrfs_page_mkwrite() The condition '!ret2' is always true. commit `717beb96d9` ("Btrfs: fix regression in btrfs_page_mkwrite() from vm_fault_t conversion") left behind the check after moving this code out of the goto, so remove the unused condition check. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:51 +01:00
Nikolay Borisov	36ee0b44ad	btrfs: Remove redundant WARN_ON in walk_down_log_tree level <0 and level >= BTRFS_MAX_LEVEL are already performed upon extent buffer read by tree checker in btrfs_check_node. go. As far as 'level <= 0' we are guaranteed that level is '> 0' because the value of level _before_ reading 'next' is larger than 1 (otherwise we wouldn't have executed that code at all) this in turn guarantees that 'level' after btrfs_read_buffer is 'level - 1' since we verify this invariant in: btrfs_read_buffer btree_read_extent_buffer_pages btrfs_verify_level_key This guarantees that level can never be '<= 0' so the warn on is never triggered. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:51 +01:00
Nikolay Borisov	5c4b691eb8	btrfs: Remove WARN_ON in walk_log_tree The log_root passed to walk_log_tree is guaranteed to have its root_key.objectid always be BTRFS_TREE_LOG_OBJECTID. This is by merit that all log roots of an ordinary root are allocated in alloc_log_tree which hard-codes objectid to be BTRFS_TREE_LOG_OBJECTID. In case walk_log_tree is called for a log tree found by btrfs_read_fs_root in btrfs_recover_log_trees, that function already ensures found_key.objectid is BTRFS_TREE_LOG_OBJECTID. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:51 +01:00
Nikolay Borisov	a0fbf736d3	btrfs: Rename __btrfs_free_reserved_extent to btrfs_pin_reserved_extent __btrfs_free_reserved_extent now performs the actions of btrfs_free_and_pin_reserved_extent. But this name is a bit of a misnomer, since the extent is not really freed but just pinned. Reflect this in the new name. No semantics changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:51 +01:00
Nikolay Borisov	7ef54d54bf	btrfs: Open code __btrfs_free_reserved_extent in btrfs_free_reserved_extent __btrfs_free_reserved_extent performs 2 entirely different operations depending on whether its 'pin' argument is true or false. This patch lifts the 2nd case (pin is false) into it's sole caller btrfs_free_reserved_extent. No semantics changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:51 +01:00
Nikolay Borisov	4eaaec24c0	btrfs: Don't discard unwritten extents All callers of btrfs_free_reserved_extent (respectively __btrfs_free_reserved_extent with in set to 0) pass in extents which have only been reserved but not yet written to. Namely, * in cow_file_range that function is called only if create_io_em fails or btrfs_add_ordered_extent fail, both of which happen _before_ any IO is submitted to the newly reserved range * in submit_compressed_extents the code flow is similar - out_free_reserve can be called only before btrfs_submit_compressed_write which is where any writes to the range could occur * btrfs_new_extent_direct also calls btrfs_free_reserved_extent only if extent_map fails, before any IO is issued * __btrfs_prealloc_file_range also calls btrfs_free_reserved_extent in case insertion of the metadata fails * btrfs_alloc_tree_block again can only be called in case in-memory operations fail, before any IO is submitted * btrfs_finish_ordered_io - this is the only caller where discarding the extent could have a material effect, since it can be called for an extent which was partially written. With this change the submission of discards is optimised since discards are now not being created for extents which are known to not have been touched on disk. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:50 +01:00
Marcos Paulo de Souza	8a36e408d4	btrfs: qgroup: return ENOTCONN instead of EINVAL when quotas are not enabled [PROBLEM] qgroup create/remove code is currently returning EINVAL when the user tries to create a qgroup on a subvolume without quota enabled. EINVAL is already being used for too many error scenarios so that is hard to depict what is the problem. [FIX] Currently scrub and balance code return -ENOTCONN when the user tries to cancel/pause and no scrub or balance is currently running for the desired subvolume. Do the same here by returning -ENOTCONN when a user tries to create/delete/assing/list a qgroup on a subvolume without quota enabled. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:50 +01:00
Marcos Paulo de Souza	e3b0edd297	btrfs: qgroup: remove one-time use variables for quota_root checks Remove some variables that are set only to be checked later, and never used. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:50 +01:00
Anand Jain	bc036bb335	btrfs: sysfs, merge btrfs_sysfs_add devices_kobj and fsid Merge btrfs_sysfs_add_fsid() and btrfs_sysfs_add_devices_kobj() functions as these two are small and they are called one after the other. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:50 +01:00
Anand Jain	be2cf92e0a	btrfs: sysfs, rename btrfs_sysfs_add_device() btrfs_sysfs_add_device() creates the directory /sys/fs/btrfs/UUID/devices but its function name is misleading. Rename it to btrfs_sysfs_add_devices_kobj() instead. No functional changes. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:50 +01:00
Anand Jain	c6761a9ed3	btrfs: sysfs, btrfs_sysfs_add_fsid() drop unused argument parent Commit `24bd69cb` ("Btrfs: sysfs: add support to add parent for fsid") added parent argument in preparation to show the seed fsid under the sprout fsid as in the patch [1] in the mailing list. [1] Btrfs: sysfs: support seed devices in the sysfs layout But later this idea was superseded by another idea to rename the fsid as in the commit `f93c39970b` ("btrfs: factor out sysfs code for updating sprout fsid"). So we don't need parent argument anymore. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:49 +01:00
Anand Jain	b5501504cb	btrfs: sysfs, rename devices kobject holder to devices_kobj The struct member btrfs_device::device_dir_kobj holds the kobj of the sysfs directory /sys/fs/btrfs/UUID/devices, so rename it from device_dir_kobj to devices_kobj. No functional changes. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:49 +01:00
David Sterba	db26a02449	btrfs: fill ncopies for all raid table entries Make the number of copies explicit even for entries that use the default 0 value for consistency. Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:49 +01:00
David Sterba	e4f6c6be81	btrfs: use raid_attr table in calc_stripe_length for nparity The table is already used for ncopies, replace open coding of stripes with the raid_attr value. Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:49 +01:00
Filipe Manana	0e56315ca1	Btrfs: fix missing hole after hole punching and fsync when using NO_HOLES When using the NO_HOLES feature, if we punch a hole into a file and then fsync it, there are cases where a subsequent fsync will miss the fact that a hole was punched, resulting in the holes not existing after replaying the log tree. Essentially these cases all imply that, tree-log.c:copy_items(), is not invoked for the leafs that delimit holes, because nothing changed those leafs in the current transaction. And it's precisely copy_items() where we currenly detect and log holes, which works as long as the holes are between file extent items in the input leaf or between the beginning of input leaf and the previous leaf or between the last item in the leaf and the next leaf. First example where we miss a hole: ) The extent items of the inode span multiple leafs; ) The punched hole covers a range that affects only the extent items of the first leaf; ) The fsync operation is done in full mode (BTRFS_INODE_NEEDS_FULL_SYNC is set in the inode's runtime flags). That results in the hole not existing after replaying the log tree. For example, if the fs/subvolume tree has the following layout for a particular inode: Leaf N, generation 10: [ ... INODE_ITEM INODE_REF EXTENT_ITEM (0 64K) EXTENT_ITEM (64K 128K) ] Leaf N + 1, generation 10: [ EXTENT_ITEM (128K 64K) ... ] If at transaction 11 we punch a hole coverting the range [0, 128K[, we end up dropping the two extent items from leaf N, but we don't touch the other leaf, so we end up in the following state: Leaf N, generation 11: [ ... INODE_ITEM INODE_REF ] Leaf N + 1, generation 10: [ EXTENT_ITEM (128K 64K) ... ] A full fsync after punching the hole will only process leaf N because it was modified in the current transaction, but not leaf N + 1, since it was not modified in the current transaction (generation 10 and not 11). As a result the fsync will not log any holes, because it didn't process any leaf with extent items. Second example where we will miss a hole: ) An inode as its items spanning 5 (or more) leafs; *) A hole is punched and it covers only the extents items of the 3rd leaf. This resulsts in deleting the entire leaf and not touching any of the other leafs. So the only leaf that is modified in the current transaction, when punching the hole, is the first leaf, which contains the inode item. During the full fsync, the only leaf that is passed to copy_items() is that first leaf, and that's not enough for the hole detection code in copy_items() to determine there's a hole between the last file extent item in the 2nd leaf and the first file extent item in the 3rd leaf (which was the 4th leaf before punching the hole). Fix this by scanning all leafs and punch holes as necessary when doing a full fsync (less common than a non-full fsync) when the NO_HOLES feature is enabled. The lack of explicit file extent items to mark holes makes it necessary to scan existing extents to determine if holes exist. A test case for fstests follows soon. Fixes: `16e7549f04` ("Btrfs: incompatible format change to remove hole extents") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:49 +01:00
Chen Yu	e79f15a459	x86/resctrl: Add task resctrl information display Monitoring tools that want to find out which resctrl control and monitor groups a task belongs to must currently read the "tasks" file in every group until they locate the process ID. Add an additional file /proc/{pid}/cpu_resctrl_groups to provide this information: 1) res: mon: resctrl is not available. 2) res:/ mon: Task is part of the root resctrl control group, and it is not associated to any monitor group. 3) res:/ mon:mon0 Task is part of the root resctrl control group and monitor group mon0. 4) res:group0 mon: Task is part of resctrl control group group0, and it is not associated to any monitor group. 5) res:group0 mon:mon1 Task is part of resctrl control group group0 and monitor group mon1. Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Jinshi Chen <jinshi.chen@intel.com> Link: https://lkml.kernel.org/r/20200115092851.14761-1-yu.c.chen@intel.com	2020-01-20 16:19:10 +01:00
Ingo Molnar	cb6c82df68	Linux 5.5-rc7 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl4k7i8eHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGvk0IAKRenVOdiudY77SQ VZjsteyrYTTQtPPv494ToIRjR0XQ+gYp8vyWzXTUC5Nm9Y9U3VzDqUPUjWszrSXE 6mU+tzcMc9qwuUxnIFn8zfg64ygw+37sn/w3xqeH4QmF9Z5Wl3EX3SdXTs7jp3RS VxiztkUNI5ZBV2GDtla5K/9qLPqCQnUYXIiyi5lAtBtiitZDVXFp7dy7hMgEiaEO +78K5Kh3xlt5ndDsBFOlwIb2Oof3KL7bBXntdbSBc/bjol6IRvAgln48HWCv59G2 jzAp2tj2KobX9GRAEPj+v4TQZEW0SXDNDi8MgQsM+3DYVCTmANsv57CBKRuf01+F nB1kAys= =zSnJ -----END PGP SIGNATURE----- Merge tag 'v5.5-rc7' into perf/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2020-01-20 08:43:44 +01:00
Jens Axboe	fa7773deb3	Merge branch 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into for-5.6/io_uring-vfs Pull in Al's openat2 branch, since we'll need that for the openat2 support. * 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: Documentation: path-lookup: include new LOOKUP flags selftests: add openat2(2) selftests open: introduce openat2(2) syscall namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution namei: LOOKUP_IN_ROOT: chroot-like scoped resolution namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution namei: LOOKUP_NO_XDEV: block mountpoint crossing namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution namei: LOOKUP_NO_SYMLINKS: block symlink resolution namei: allow set_root() to produce errors namei: allow nd_jump_link() to produce errors nsfs: clean-up ns_get_path() signature to return int namei: only return -ECHILD from follow_dotdot_rcu()	2020-01-19 19:47:04 -07:00
Aleksa Sarai	fddb5d430a	open: introduce openat2(2) syscall /* Background. / For a very long time, extending openat(2) with new features has been incredibly frustrating. This stems from the fact that openat(2) is possibly the most famous counter-example to the mantra "don't silently accept garbage from userspace" -- it doesn't check whether unknown flags are present[1]. This means that (generally) the addition of new flags to openat(2) has been fraught with backwards-compatibility issues (O_TMPFILE has to be defined as __O_TMPFILE\|O_DIRECTORY\|[O_RDWR or O_WRONLY] to ensure old kernels gave errors, since it's insecure to silently ignore the flag[2]). All new security-related flags therefore have a tough road to being added to openat(2). Userspace also has a hard time figuring out whether a particular flag is supported on a particular kernel. While it is now possible with contemporary kernels (thanks to [3]), older kernels will expose unknown flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during openat(2) time matches modern syscall designs and is far more fool-proof. In addition, the newly-added path resolution restriction LOOKUP flags (which we would like to expose to user-space) don't feel related to the pre-existing O_ flag set -- they affect all components of path lookup. We'd therefore like to add a new flag argument. Adding a new syscall allows us to finally fix the flag-ignoring problem, and we can make it extensible enough so that we will hopefully never need an openat3(2). /* Syscall Prototype. / / * open_how is an extensible structure (similar in interface to * clone3(2) or sched_setattr(2)). The size parameter must be set to * sizeof(struct open_how), to allow for future extensions. All future * extensions will be appended to open_how, with their zero value * acting as a no-op default. / struct open_how { / ... / }; int openat2(int dfd, const char pathname, struct open_how how, size_t size); / Description. / The initial version of 'struct open_how' contains the following fields: flags Used to specify openat(2)-style flags. However, any unknown flag bits or otherwise incorrect flag combinations (like O_PATH\|O_RDWR) will result in -EINVAL. In addition, this field is 64-bits wide to allow for more O_ flags than currently permitted with openat(2). mode The file mode for O_CREAT or O_TMPFILE. Must be set to zero if flags does not contain O_CREAT or O_TMPFILE. resolve Restrict path resolution (in contrast to O_ flags they affect all path components). The current set of flags are as follows (at the moment, all of the RESOLVE_ flags are implemented as just passing the corresponding LOOKUP_ flag). RESOLVE_NO_XDEV => LOOKUP_NO_XDEV RESOLVE_NO_SYMLINKS => LOOKUP_NO_SYMLINKS RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS RESOLVE_BENEATH => LOOKUP_BENEATH RESOLVE_IN_ROOT => LOOKUP_IN_ROOT open_how does not contain an embedded size field, because it is of little benefit (userspace can figure out the kernel open_how size at runtime fairly easily without it). It also only contains u64s (even though ->mode arguably should be a u16) to avoid having padding fields which are never used in the future. Note that as a result of the new how->flags handling, O_PATH\|O_TMPFILE is no longer permitted for openat(2). As far as I can tell, this has always been a bug and appears to not be used by userspace (and I've not seen any problems on my machines by disallowing it). If it turns out this breaks something, we can special-case it and only permit it for openat(2) but not openat2(2). After input from Florian Weimer, the new open_how and flag definitions are inside a separate header from uapi/linux/fcntl.h, to avoid problems that glibc has with importing that header. /* Testing. / In a follow-up patch there are over 200 selftests which ensure that this syscall has the correct semantics and will correctly handle several attack scenarios. In addition, I've written a userspace library[4] which provides convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care must be taken when using RESOLVE_IN_ROOT'd file descriptors with other syscalls). During the development of this patch, I've run numerous verification tests using libpathrs (showing that the API is reasonably usable by userspace). / Future Work. */ Additional RESOLVE_ flags have been suggested during the review period. These can be easily implemented separately (such as blocking auto-mount during resolution). Furthermore, there are some other proposed changes to the openat(2) interface (the most obvious example is magic-link hardening[5]) which would be a good opportunity to add a way for userspace to restrict how O_PATH file descriptors can be re-opened. Another possible avenue of future work would be some kind of CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace which openat2(2) flags and fields are supported by the current kernel (to avoid userspace having to go through several guesses to figure it out). [1]: https://lwn.net/Articles/588444/ [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com [3]: commit `629e014bb8` ("fs: completely ignore unknown open flags") [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523 [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/ [6]: https://youtu.be/ggD-eb3yPVs Suggested-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-18 09:19:18 -05:00
Linus Torvalds	25e73aadf2	io_uring-5.5-2020-01-16 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl4hQEoQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjZ0D/9X0JXHTlP8qebVw0Rjnb838TJtygwBZ8bm EsvEYP9lOJbR15V2WGWO2daNaaKMouglMQ8OWYMNGvREDtfNBxy3mE8ZVnpG385R RUweqCIK0rHpAfSRr4Nh9GwIeMyomLzOeumjVzXATsUS1o2+bCfv34pe22uikpgx njA2ab389hS2b9fMOFf78odazOMiCQSW7a2dwO1+5TNWtmYCei3SNPZuqZucvRPr 9iSnZswJZb8KqyGyuJo6dQQhvurXgAM8LRglc6KIJ1NpyJCgPzyULYEYOvLyLHLo USvvivi5xFeUQy1x7w72Xu3dQ0Jg+i9nSDiAACM+ehCdVcKC0OcbFcvPJ06iH9V3 RRdBUBJHHXSzklVHpo44iwZcmPNQNAWwM/vtlsrT9ln9fkgLeHG3zsScKOcv9fFw 9YmtmZQkw9Zst5wghiOQsLhwsUndOPLLUbtiNGmUr1eKXeRYekFpO++HI/DwkWhN rFVJiHbMxIP0k7uk54sNPoHrXthfNiiFjOf4eZDV20xwVJ0xenmYpfW8XW447r3W C2dGRtRBbm598OCV0PzXFd1vIUKAr8b8fJwS3gZzZOH0uYbYr79AOn1cs2F//0M0 MUXZo9LHfpfeGkMzimiPZj7lrZEI4LPAjYc1mnt4fXhuzPhYAinkcU3tQRY0T+ia 4YqjdDtD3Q== =MhHp -----END PGP SIGNATURE----- Merge tag 'io_uring-5.5-2020-01-16' of git://git.kernel.dk/linux-block Pull io_uring fixes form Jens Axboe: - Ensure ->result is always set when IO is retried (Bijan) - In conjunction with the above, fix a regression in polled IO issue when retried (me/Bijan) - Don't setup async context for read/write fixed, otherwise we may wrongly map the iovec on retry (me) - Cancel io-wq work if we fail getting mm reference (me) - Ensure dependent work is always initialized correctly (me) - Only allow original task to submit IO, don't allow it from a passed ring fd (me) * tag 'io_uring-5.5-2020-01-16' of git://git.kernel.dk/linux-block: io_uring: only allow submit from owning task io_uring: ensure workqueue offload grabs ring mutex for poll list io_uring: clear req->result always before issuing a read/write request io_uring: be consistent in assigning next work from handler io-wq: cancel work if we fail getting a mm reference io_uring: don't setup async context for read/write fixed	2020-01-17 11:25:45 -08:00
Linus Torvalds	effaf90137	for-5.5-rc6-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl4h7wwACgkQxWXV+ddt WDtFgxAAqanZ3wqq8xNqybfWmmFNrKtdkakErRuQFWe/+kZH59HuBifTbYeVMD4Z hFgveFJpBIqo7uMUxUItSTtHcr3qx7TP9ejGYOaQO997oNPxPQXuEY8Lq5ebDBVB 89Gn+Eg/Q+uPvCJSctxx4dblSiGZKb3iOEh+lJuWJV4bj8beekcTrqsg01ZchPRO Ygk1ltW7Vpf0wVkdts4FKiKiwX02M2C9zxh9NQjpNwH1DMow4XtBPsbqHbiHzRym SoD4+0dbhfdnKkNnBTFEJBbjbZcYwM9EQnfiyVL+/hDMHX4XTetqeFN1G8usfXXX 2kxvwttPUtluJqlQXQnUU4mQEA4p5ORTgGgw1WBF3h+Aezumkql+27Bd6aiDKGZz SPc9sveft60R23TxorlrYVqfADgyZKEaZ+2wEM99Xoz4OdvP7jkqDentJW9us1Xh Xmfovq5xcRY17f9tdhiwqH5vgwxrLgmjBvTm/kcGX3ImhU8Yxk8xKw1JoV0P9cjW 7awK4l8pyPbOUhekdT8hYqWXlL/DXhAMHraV1zfBKIbu1omlGByeg23jNM2iS/0B YtRkEEen0tRHpuKLB08twTKCak94wObBamKNFE6Snt1cDudLwGDpUosazM9l4uPR 2D3SHs7UWNgtvTRCfq2LVoRMRSR2BA1b19EkylUig4ay7khmZ2k= =2e9q -----END PGP SIGNATURE----- Merge tag 'for-5.5-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes that have been in the works during last twp weeks. All have a user visible effect and are stable material: - scrub: properly update progress after calling cancel ioctl, calling 'resume' would start from the beginning otherwise - fix subvolume reference removal, after moving out of the original path the reference is not recognized and will lead to transaction abort - fix reloc root lifetime checks, could lead to crashes when there's subvolume cleaning running in parallel - fix memory leak when quotas get disabled in the middle of extent accounting - fix transaction abort in case of balance being started on degraded mount on eg. RAID1" * tag 'for-5.5-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: check rw_devices, not num_devices for balance Btrfs: always copy scrub arguments back to user space btrfs: relocation: fix reloc_root lifespan and access btrfs: fix memory leak in qgroup accounting btrfs: do not delete mismatched root refs btrfs: fix invalid removal of root ref btrfs: rework arguments of btrfs_unlink_subvol	2020-01-17 11:21:05 -08:00
Linus Torvalds	ab7541c3ad	fuse fixes for 5.5-rc7 -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXiF1QwAKCRDh3BK/laaZ POf9AQCoPHnT7oH1gYUHfZAhS4cYX72+v6F75gYKUce0/jSDPQEAhbcMhoo31aO2 BGTXRkeCVtg77IhxUmhXCLoQYjpSoQc= =UOsx -----END PGP SIGNATURE----- Merge tag 'fuse-fixes-5.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fix from Miklos Szeredi: "Fix a regression in the last release affecting the ftp module of the gvfs filesystem" * tag 'fuse-fixes-5.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: fix fuse_send_readpages() in the syncronous read case	2020-01-17 08:42:02 -08:00
Josef Bacik	b35cf1f0bf	btrfs: check rw_devices, not num_devices for balance The fstest btrfs/154 reports [ 8675.381709] BTRFS: Transaction aborted (error -28) [ 8675.383302] WARNING: CPU: 1 PID: 31900 at fs/btrfs/block-group.c:2038 btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs] [ 8675.390925] CPU: 1 PID: 31900 Comm: btrfs Not tainted 5.5.0-rc6-default+ #935 [ 8675.392780] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 [ 8675.395452] RIP: 0010:btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs] [ 8675.402672] RSP: 0018:ffffb2090888fb00 EFLAGS: 00010286 [ 8675.404413] RAX: 0000000000000000 RBX: ffff92026dfa91c8 RCX: 0000000000000001 [ 8675.406609] RDX: 0000000000000000 RSI: ffffffff8e100899 RDI: ffffffff8e100971 [ 8675.408775] RBP: ffff920247c61660 R08: 0000000000000000 R09: 0000000000000000 [ 8675.410978] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffe4 [ 8675.412647] R13: ffff92026db74000 R14: ffff920247c616b8 R15: ffff92026dfbc000 [ 8675.413994] FS: 00007fd5e57248c0(0000) GS:ffff92027d800000(0000) knlGS:0000000000000000 [ 8675.416146] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8675.417833] CR2: 0000564aa51682d8 CR3: 000000006dcbc004 CR4: 0000000000160ee0 [ 8675.419801] Call Trace: [ 8675.420742] btrfs_start_dirty_block_groups+0x355/0x480 [btrfs] [ 8675.422600] btrfs_commit_transaction+0xc8/0xaf0 [btrfs] [ 8675.424335] reset_balance_state+0x14a/0x190 [btrfs] [ 8675.425824] btrfs_balance.cold+0xe7/0x154 [btrfs] [ 8675.427313] ? kmem_cache_alloc_trace+0x235/0x2c0 [ 8675.428663] btrfs_ioctl_balance+0x298/0x350 [btrfs] [ 8675.430285] btrfs_ioctl+0x466/0x2550 [btrfs] [ 8675.431788] ? mem_cgroup_charge_statistics+0x51/0xf0 [ 8675.433487] ? mem_cgroup_commit_charge+0x56/0x400 [ 8675.435122] ? do_raw_spin_unlock+0x4b/0xc0 [ 8675.436618] ? _raw_spin_unlock+0x1f/0x30 [ 8675.438093] ? __handle_mm_fault+0x499/0x740 [ 8675.439619] ? do_vfs_ioctl+0x56e/0x770 [ 8675.441034] do_vfs_ioctl+0x56e/0x770 [ 8675.442411] ksys_ioctl+0x3a/0x70 [ 8675.443718] ? trace_hardirqs_off_thunk+0x1a/0x1c [ 8675.445333] __x64_sys_ioctl+0x16/0x20 [ 8675.446705] do_syscall_64+0x50/0x210 [ 8675.448059] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 8675.479187] BTRFS: error (device vdb) in btrfs_create_pending_block_groups:2038: errno=-28 No space left We now use btrfs_can_overcommit() to see if we can flip a block group read only. Before this would fail because we weren't taking into account the usable un-allocated space for allocating chunks. With my patches we were allowed to do the balance, which is technically correct. The test is trying to start balance on degraded mount. So now we're trying to allocate a chunk and cannot because we want to allocate a RAID1 chunk, but there's only 1 device that's available for usage. This results in an ENOSPC. But we shouldn't even be making it this far, we don't have enough devices to restripe. The problem is we're using btrfs_num_devices(), that also includes missing devices. That's not actually what we want, we need to use rw_devices. The chunk_mutex is not needed here, rw_devices changes only in device add, remove or replace, all are excluded by EXCL_OP mechanism. Fixes: `e4d8ec0f65` ("Btrfs: implement online profile changing") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add stacktrace, update changelog, drop chunk_mutex ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-17 15:40:54 +01:00
Filipe Manana	5afe6ce748	Btrfs: always copy scrub arguments back to user space If scrub returns an error we are not copying back the scrub arguments structure to user space. This prevents user space to know how much progress scrub has done if an error happened - this includes -ECANCELED which is returned when users ask for scrub to stop. A particular use case, which is used in btrfs-progs, is to resume scrub after it is canceled, in that case it relies on checking the progress from the scrub arguments structure and then use that progress in a call to resume scrub. So fix this by always copying the scrub arguments structure to user space, overwriting the value returned to user space with -EFAULT only if copying the structure failed to let user space know that either that copying did not happen, and therefore the structure is stale, or it happened partially and the structure is probably not valid and corrupt due to the partial copy. Reported-by: Graham Cobb <g.btrfs@cobb.uk.net> Link: https://lore.kernel.org/linux-btrfs/d0a97688-78be-08de-ca7d-bcb4c7fb397e@cobb.uk.net/ Fixes: `06fe39ab15` ("Btrfs: do not overwrite scrub error with fault error in scrub ioctl") CC: stable@vger.kernel.org # 5.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Tested-by: Graham Cobb <g.btrfs@cobb.uk.net> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-17 15:28:52 +01:00
Jens Axboe	44d282796f	io_uring: only allow submit from owning task If the credentials or the mm doesn't match, don't allow the task to submit anything on behalf of this ring. The task that owns the ring can pass the file descriptor to another task, but we don't want to allow that task to submit an SQE that then assumes the ring mm and creds if it needs to go async. Cc: stable@vger.kernel.org Suggested-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-16 21:43:24 -07:00
Jeff Mahoney	394440d469	reiserfs: fix handling of -EOPNOTSUPP in reiserfs_for_each_xattr Commit `60e4cf67a5` (reiserfs: fix extended attributes on the root directory) introduced a regression open_xa_root started returning -EOPNOTSUPP but it was not handled properly in reiserfs_for_each_xattr. When the reiserfs module is built without CONFIG_REISERFS_FS_XATTR, deleting an inode would result in a warning and chowning an inode would also result in a warning and then fail to complete. With CONFIG_REISERFS_FS_XATTR enabled, the xattr root would always be present for read-write operations. This commit handles -EOPNOSUPP in the same way -ENODATA is handled. Fixes: `60e4cf67a5` ("reiserfs: fix extended attributes on the root directory") CC: stable@vger.kernel.org # Commit `60e4cf67a5` was picked up by stable Link: https://lore.kernel.org/r/20200115180059.6935-1-jeffm@suse.com Reported-by: Michael Brunnbauer <brunni@netestate.de> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Jan Kara <jack@suse.cz>	2020-01-16 16:49:29 +01:00
Miklos Szeredi	7df1e988c7	fuse: fix fuse_send_readpages() in the syncronous read case Buffered read in fuse normally goes via: -> generic_file_buffered_read() -> fuse_readpages() -> fuse_send_readpages() ->fuse_simple_request() [called since v5.4] In the case of a read request, fuse_simple_request() will return a non-negative bytecount on success or a negative error value. A positive bytecount was taken to be an error and the PG_error flag set on the page. This resulted in generic_file_buffered_read() falling back to ->readpage(), which would repeat the read request and succeed. Because of the repeated read succeeding the bug was not detected with regression tests or other use cases. The FTP module in GVFS however fails the second read due to the non-seekable nature of FTP downloads. Fix by checking and ignoring positive return value from fuse_simple_request(). Reported-by: Ondrej Holy <oholy@redhat.com> Link: https://gitlab.gnome.org/GNOME/gvfs/issues/441 Fixes: `134831e36b` ("fuse: convert readpages to simple api") Cc: <stable@vger.kernel.org> # v5.4 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-01-16 11:09:36 +01:00
Jens Axboe	11ba820bf1	io_uring: ensure workqueue offload grabs ring mutex for poll list A previous commit moved the locking for the async sqthread, but didn't take into account that the io-wq workers still need it. We can't use req->in_async for this anymore as both the sqthread and io-wq workers set it, gate the need for locking on io_wq_current_is_worker() instead. Fixes: `8a4955ff1c` ("io_uring: sqthread should grab ctx->uring_lock for submissions") Reported-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-15 21:51:17 -07:00
Bijan Mottahedeh	797f3f535d	io_uring: clear req->result always before issuing a read/write request req->result is cleared when io_issue_sqe() calls io_read/write_pre() routines. Those routines however are not called when the sqe argument is NULL, which is the case when io_issue_sqe() is called from io_wq_submit_work(). io_issue_sqe() may then examine a stale result if a polled request had previously failed with -EAGAIN: if (ctx->flags & IORING_SETUP_IOPOLL) { if (req->result == -EAGAIN) return -EAGAIN; io_iopoll_req_issued(req); } and in turn cause a subsequently completed request to be re-issued in io_wq_submit_work(). Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-15 21:36:13 -07:00
Linus Torvalds	84bf39461e	Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "Fixes for mountpoint_last() bugs (by converting to use of lookup_last()) and an autofs regression fix from this cycle (caused by follow_managed() breakage introduced in barrier fixes series)" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fix autofs regression caused by follow_managed() changes reimplement path_mountpoint() with less magic	2020-01-15 09:58:14 -08:00
Al Viro	508c877276	fix autofs regression caused by follow_managed() changes we need to reload ->d_flags after the call of ->d_manage() - the thing might've been called with dentry still negative and have the damn thing turned positive while we'd waited. Fixes: `d41efb522e` "fs/namei.c: pull positivity check into follow_managed()" Reported-by: Ian Kent <raven@themaw.net> Tested-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-15 01:36:46 -05:00
Al Viro	c64cd6e34e	reimplement path_mountpoint() with less magic ... and get rid of a bunch of bugs in it. Background: the reason for path_mountpoint() is that umount() really doesn't want attempts to revalidate the root of what it's trying to umount. The thing we want to avoid actually happen from complete_walk(); solution was to do something parallel to normal path_lookupat() and it both went overboard and got the boilerplate subtly (and not so subtly) wrong. A better solution is to do pretty much what the normal path_lookupat() does, but instead of complete_walk() do unlazy_walk(). All it takes to avoid that ->d_weak_revalidate() call... mountpoint_last() goes away, along with everything it got wrong, and so does the magic around LOOKUP_NO_REVAL. Another source of bugs is that when we traverse mounts at the final location (and we need to do that - umount . expects to get whatever's overmounting ., if any, out of the lookup) we really ought to take care of ->d_manage() - as it is, manual umount of autofs automount in progress can lead to unpleasant surprises for the daemon. Easily solved by using handle_lookup_down() instead of follow_mount(). Tested-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-15 01:36:06 -05:00
Jens Axboe	78912934f4	io_uring: be consistent in assigning next work from handler If we pass back dependent work in case of links, we need to always ensure that we call the link setup and work prep handler. If not, we might be missing some setup for the next work item. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-14 22:09:06 -07:00
Jens Axboe	e0bbb3461a	io-wq: cancel work if we fail getting a mm reference If we require mm and user context, mark the request for cancellation if we fail to acquire the desired mm. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-14 22:06:11 -07:00
Eric Biggers	da3a3da4e6	fs-verity: use u64_to_user_ptr() <linux/kernel.h> already provides a macro u64_to_user_ptr(). Use it instead of open-coding the two casts. No change in behavior. Link: https://lore.kernel.org/r/20191231175408.20524-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-01-14 13:28:28 -08:00
Eric Biggers	439bea104c	fs-verity: use mempool for hash requests When initializing an fs-verity hash algorithm, also initialize a mempool that contains a single preallocated hash request object. Then replace the direct calls to ahash_request_alloc() and ahash_request_free() with allocating and freeing from this mempool. This eliminates the possibility of the allocation failing, which is desirable for the I/O path. This doesn't cause deadlocks because there's no case where multiple hash requests are needed at a time to make forward progress. Link: https://lore.kernel.org/r/20191231175545.20709-1-ebiggers@kernel.org Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-01-14 13:28:05 -08:00
Eric Biggers	fd39073dba	fs-verity: implement readahead of Merkle tree pages When fs-verity verifies data pages, currently it reads each Merkle tree page synchronously using read_mapping_page(). Therefore, when the Merkle tree pages aren't already cached, fs-verity causes an extra 4 KiB I/O request for every 512 KiB of data (assuming that the Merkle tree uses SHA-256 and 4 KiB blocks). This results in more I/O requests and performance loss than is strictly necessary. Therefore, implement readahead of the Merkle tree pages. For simplicity, we take advantage of the fact that the kernel already does readahead of the file's data, just like it does for any other file. Due to this, we don't really need a separate readahead state (struct file_ra_state) just for the Merkle tree, but rather we just need to piggy-back on the existing data readahead requests. We also only really need to bother with the first level of the Merkle tree, since the usual fan-out factor is 128, so normally over 99% of Merkle tree I/O requests are for the first level. Therefore, make fsverity_verify_bio() enable readahead of the first Merkle tree level, for up to 1/4 the number of pages in the bio, when it sees that the REQ_RAHEAD flag is set on the bio. The readahead size is then passed down to ->read_merkle_tree_page() for the filesystem to (optionally) implement if it sees that the requested page is uncached. While we're at it, also make build_merkle_tree_level() set the Merkle tree readahead size, since it's easy to do there. However, for now don't set the readahead size in fsverity_verify_page(), since currently it's only used to verify holes on ext4 and f2fs, and it would need parameters added to know how much to read ahead. This patch significantly improves fs-verity sequential read performance. Some quick benchmarks with 'cat'-ing a 250MB file after dropping caches: On an ARM64 phone (using sha256-ce): Before: 217 MB/s After: 263 MB/s (compare to sha256sum of non-verity file: 357 MB/s) In an x86_64 VM (using sha256-avx2): Before: 173 MB/s After: 215 MB/s (compare to sha256sum of non-verity file: 223 MB/s) Link: https://lore.kernel.org/r/20200106205533.137005-1-ebiggers@kernel.org Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-01-14 13:27:32 -08:00
Eric Biggers	c22415d333	fs-verity: implement readahead for FS_IOC_ENABLE_VERITY When it builds the first level of the Merkle tree, FS_IOC_ENABLE_VERITY sequentially reads each page of the file using read_mapping_page(). This works fine if the file's data is already in pagecache, which should normally be the case, since this ioctl is normally used immediately after writing out the file. But in any other case this implementation performs very poorly, since only one page is read at a time. Fix this by implementing readahead using the functions from mm/readahead.c. This improves performance in the uncached case by about 20x, as seen in the following benchmarks done on a 250MB file (on x86_64 with SHA-NI): FS_IOC_ENABLE_VERITY uncached (before) 3.299s FS_IOC_ENABLE_VERITY uncached (after) 0.160s FS_IOC_ENABLE_VERITY cached 0.147s sha256sum uncached 0.191s sha256sum cached 0.145s Note: we could instead switch to kernel_read(). But that would mean we'd no longer be hashing the data directly from the pagecache, which is a nice optimization of its own. And using kernel_read() would require allocating another temporary buffer, hashing the data and tree pages separately, and explicitly zero-padding the last page -- so it wouldn't really be any simpler than direct pagecache access, at least for now. Link: https://lore.kernel.org/r/20200106205410.136707-1-ebiggers@kernel.org Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-01-14 13:27:32 -08:00
Eric Biggers	2d8f7f119b	fscrypt: document gfp_flags for bounce page allocation Document that fscrypt_encrypt_pagecache_blocks() allocates the bounce page from a mempool, and document what this means for the @gfp_flags argument. Link: https://lore.kernel.org/r/20191231181026.47400-1-ebiggers@kernel.org Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-01-14 12:51:12 -08:00
Eric Biggers	796f12d742	fscrypt: optimize fscrypt_zeroout_range() Currently fscrypt_zeroout_range() issues and waits on a bio for each block it writes, which makes it very slow. Optimize it to write up to 16 pages at a time instead. Also add a function comment, and improve reliability by allowing the allocations of the bio and the first ciphertext page to wait on the corresponding mempools. Link: https://lore.kernel.org/r/20191226160813.53182-1-ebiggers@kernel.org Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-01-14 12:50:33 -08:00
Linus Torvalds	e033e7d4a8	Merge branch 'dhowells' (patches from DavidH) Merge misc fixes from David Howells. Two afs fixes and a key refcounting fix. * dhowells: afs: Fix afs_lookup() to not clobber the version on a new dentry afs: Fix use-after-loss-of-ref keys: Fix request_key() cache	2020-01-14 09:56:31 -08:00
David Howells	f52b83b0b1	afs: Fix afs_lookup() to not clobber the version on a new dentry Fix afs_lookup() to not clobber the version set on a new dentry by afs_do_lookup() - especially as it's using the wrong version of the version (we need to use the one given to us by whatever op the dir contents correspond to rather than what's in the afs_vnode). Fixes: `9dd0b82ef5` ("afs: Fix missing dentry data version updating") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-01-14 09:40:06 -08:00
David Howells	40a708bd62	afs: Fix use-after-loss-of-ref afs_lookup() has a tracepoint to indicate the outcome of d_splice_alias(), passing it the inode to retrieve the fid from. However, the function gave up its ref on that inode when it called d_splice_alias(), which may have failed and dropped the inode. Fix this by caching the fid. Fixes: `80548b0399` ("afs: Add more tracepoints") Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-01-14 09:40:06 -08:00
Eric Snowberg	a37f4958f7	debugfs: Return -EPERM when locked down When lockdown is enabled, debugfs_is_locked_down returns 1. It will then trigger the following: WARNING: CPU: 48 PID: 3747 CPU: 48 PID: 3743 Comm: bash Not tainted 5.4.0-1946.x86_64 #1 Hardware name: Oracle Corporation ORACLE SERVER X7-2/ASM, MB, X7-2, BIOS 41060400 05/20/2019 RIP: 0010:do_dentry_open+0x343/0x3a0 Code: 00 40 08 00 45 31 ff 48 c7 43 28 40 5b e7 89 e9 02 ff ff ff 48 8b 53 28 4c 8b 72 70 4d 85 f6 0f 84 10 fe ff ff e9 f5 fd ff ff <0f> 0b 41 bf ea ff ff ff e9 3b ff ff ff 41 bf e6 ff ff ff e9 b4 fe RSP: 0018:ffffb8740dde7ca0 EFLAGS: 00010202 RAX: ffffffff89e88a40 RBX: ffff928c8e6b6f00 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff928dbfd97778 RDI: ffff9285cff685c0 RBP: ffffb8740dde7cc8 R08: 0000000000000821 R09: 0000000000000030 R10: 0000000000000057 R11: ffffb8740dde7a98 R12: ffff926ec781c900 R13: ffff928c8e6b6f10 R14: ffffffff8936e190 R15: 0000000000000001 FS: 00007f45f6777740(0000) GS:ffff928dbfd80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fff95e0d5d8 CR3: 0000001ece562006 CR4: 00000000007606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: vfs_open+0x2d/0x30 path_openat+0x2d4/0x1680 ? tty_mode_ioctl+0x298/0x4c0 do_filp_open+0x93/0x100 ? strncpy_from_user+0x57/0x1b0 ? __alloc_fd+0x46/0x150 do_sys_open+0x182/0x230 __x64_sys_openat+0x20/0x30 do_syscall_64+0x60/0x1b0 entry_SYSCALL_64_after_hwframe+0x170/0x1d5 RIP: 0033:0x7f45f5e5ce02 Code: 25 00 00 41 00 3d 00 00 41 00 74 4c 48 8d 05 25 59 2d 00 8b 00 85 c0 75 6d 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 0f 87 a2 00 00 00 48 8b 4c 24 28 64 48 33 0c 25 RSP: 002b:00007fff95e0d2e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 RAX: ffffffffffffffda RBX: 0000561178c069b0 RCX: 00007f45f5e5ce02 RDX: 0000000000000241 RSI: 0000561178c08800 RDI: 00000000ffffff9c RBP: 00007fff95e0d3e0 R08: 0000000000000020 R09: 0000000000000005 R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000003 R14: 0000000000000001 R15: 0000561178c08800 Change the return type to int and return -EPERM when lockdown is enabled to remove the warning above. Also rename debugfs_is_locked_down to debugfs_locked_down to make it sound less like it returns a boolean. Fixes: `5496197f9b` ("debugfs: Restrict debugfs when the kernel is locked down") Signed-off-by: Eric Snowberg <eric.snowberg@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: stable <stable@vger.kernel.org> Acked-by: James Morris <jamorris@linux.microsoft.com> Link: https://lore.kernel.org/r/20191207161603.35907-1-eric.snowberg@oracle.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-01-14 16:14:48 +01:00
Mateusz Nosek	5bf33f04eb	fs/kernfs/dir.c: Clean code by removing always true condition Previously there was an additional check if variable pos is not null. However, this check happens after entering while loop and only then, which can happen only if pos is not null. Therefore the additional check is redundant and can be removed. Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20191230191628.21099-1-mateusznosek0@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-01-14 16:14:47 +01:00
Andrei Vagin	04a8682a71	fs/proc: Introduce /proc/pid/timens_offsets API to set time namespace offsets for children processes, i.e.: echo "$clockid $offset_sec $offset_nsec" > /proc/self/timens_offsets Co-developed-by: Dmitry Safonov <dima@arista.com> Signed-off-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: Dmitry Safonov <dima@arista.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20191112012724.250792-28-dima@arista.com	2020-01-14 12:20:59 +01:00
Dmitry Safonov	0efc8bb0bb	fs/proc: Respect boottime inside time namespace for /proc/uptime Make sure that /proc/uptime is adjusted to the tasks time namespace. Co-developed-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: Dmitry Safonov <dima@arista.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20191112012724.250792-19-dima@arista.com	2020-01-14 12:20:56 +01:00
Andrei Vagin	6cd889d43c	timerfd: Make timerfd_settime() time namespace aware timerfd_settime() accepts an absolute value of the expiration time if TFD_TIMER_ABSTIME is specified. This value is in the task's time namespace and has to be converted to the host's time namespace. Co-developed-by: Dmitry Safonov <dima@arista.com> Signed-off-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: Dmitry Safonov <dima@arista.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20191112012724.250792-14-dima@arista.com	2020-01-14 12:20:53 +01:00
Andrei Vagin	769071ac9f	ns: Introduce Time Namespace Time Namespace isolates clock values. The kernel provides access to several clocks CLOCK_REALTIME, CLOCK_MONOTONIC, CLOCK_BOOTTIME, etc. CLOCK_REALTIME System-wide clock that measures real (i.e., wall-clock) time. CLOCK_MONOTONIC Clock that cannot be set and represents monotonic time since some unspecified starting point. CLOCK_BOOTTIME Identical to CLOCK_MONOTONIC, except it also includes any time that the system is suspended. For many users, the time namespace means the ability to changes date and time in a container (CLOCK_REALTIME). Providing per namespace notions of CLOCK_REALTIME would be complex with a massive overhead, but has a dubious value. But in the context of checkpoint/restore functionality, monotonic and boottime clocks become interesting. Both clocks are monotonic with unspecified starting points. These clocks are widely used to measure time slices and set timers. After restoring or migrating processes, it has to be guaranteed that they never go backward. In an ideal case, the behavior of these clocks should be the same as for a case when a whole system is suspended. All this means that it is required to set CLOCK_MONOTONIC and CLOCK_BOOTTIME clocks, which can be achieved by adding per-namespace offsets for clocks. A time namespace is similar to a pid namespace in the way how it is created: unshare(CLONE_NEWTIME) system call creates a new time namespace, but doesn't set it to the current process. Then all children of the process will be born in the new time namespace, or a process can use the setns() system call to join a namespace. This scheme allows setting clock offsets for a namespace, before any processes appear in it. All available clone flags have been used, so CLONE_NEWTIME uses the highest bit of CSIGNAL. It means that it can be used only with the unshare() and the clone3() system calls. [ tglx: Adjusted paragraph about clone3() to reality and massaged the changelog a bit. ] Co-developed-by: Dmitry Safonov <dima@arista.com> Signed-off-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: Dmitry Safonov <dima@arista.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://criu.org/Time_namespace Link: https://lists.openvz.org/pipermail/criu/2018-June/041504.html Link: https://lore.kernel.org/r/20191112012724.250792-4-dima@arista.com	2020-01-14 12:20:48 +01:00
Jens Axboe	74566df3a7	io_uring: don't setup async context for read/write fixed We don't need it, and if we have it, then the retry handler will attempt to copy the non-existent iovec with the inline iovec, with a segment count that doesn't make sense. Fixes: `f67676d160` ("io_uring: ensure async punted read/write requests copy iovec") Reported-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-13 19:25:29 -07:00
Qu Wenruo	6282675e67	btrfs: relocation: fix reloc_root lifespan and access [BUG] There are several different KASAN reports for balance + snapshot workloads. Involved call paths include: should_ignore_root+0x54/0xb0 [btrfs] build_backref_tree+0x11af/0x2280 [btrfs] relocate_tree_blocks+0x391/0xb80 [btrfs] relocate_block_group+0x3e5/0xa00 [btrfs] btrfs_relocate_block_group+0x240/0x4d0 [btrfs] btrfs_relocate_chunk+0x53/0xf0 [btrfs] btrfs_balance+0xc91/0x1840 [btrfs] btrfs_ioctl_balance+0x416/0x4e0 [btrfs] btrfs_ioctl+0x8af/0x3e60 [btrfs] do_vfs_ioctl+0x831/0xb10 create_reloc_root+0x9f/0x460 [btrfs] btrfs_reloc_post_snapshot+0xff/0x6c0 [btrfs] create_pending_snapshot+0xa9b/0x15f0 [btrfs] create_pending_snapshots+0x111/0x140 [btrfs] btrfs_commit_transaction+0x7a6/0x1360 [btrfs] btrfs_mksubvol+0x915/0x960 [btrfs] btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs] btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs] btrfs_ioctl+0x241b/0x3e60 [btrfs] do_vfs_ioctl+0x831/0xb10 btrfs_reloc_pre_snapshot+0x85/0xc0 [btrfs] create_pending_snapshot+0x209/0x15f0 [btrfs] create_pending_snapshots+0x111/0x140 [btrfs] btrfs_commit_transaction+0x7a6/0x1360 [btrfs] btrfs_mksubvol+0x915/0x960 [btrfs] btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs] btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs] btrfs_ioctl+0x241b/0x3e60 [btrfs] do_vfs_ioctl+0x831/0xb10 [CAUSE] All these call sites are only relying on root->reloc_root, which can undergo btrfs_drop_snapshot(), and since we don't have real refcount based protection to reloc roots, we can reach already dropped reloc root, triggering KASAN. [FIX] To avoid such access to unstable root->reloc_root, we should check BTRFS_ROOT_DEAD_RELOC_TREE bit first. This patch introduces wrappers that provide the correct way to check the bit with memory barriers protection. Most callers don't distinguish merged reloc tree and no reloc tree. The only exception is should_ignore_root(), as merged reloc tree can be ignored, while no reloc tree shouldn't. [CRITICAL SECTION ANALYSIS] Although test_bit()/set_bit()/clear_bit() doesn't imply a barrier, the DEAD_RELOC_TREE bit has extra help from transaction as a higher level barrier, the lifespan of root::reloc_root and DEAD_RELOC_TREE bit are: NULL: reloc_root is NULL PTR: reloc_root is not NULL 0: DEAD_RELOC_ROOT bit not set DEAD: DEAD_RELOC_ROOT bit set (NULL, 0) Initial state __ \| /\ Section A btrfs_init_reloc_root() \/ \| __ (PTR, 0) reloc_root initialized /\ \| \| btrfs_update_reloc_root() \| Section B \| \| (PTR, DEAD) reloc_root has been merged \/ \| __ === btrfs_commit_transaction() ==================== \| /\ clean_dirty_subvols() \| \| \| Section C (NULL, DEAD) reloc_root cleanup starts \/ \| __ btrfs_drop_snapshot() /\ \| \| Section D (NULL, 0) Back to initial state \/ Every have_reloc_root() or test_bit(DEAD_RELOC_ROOT) caller holds transaction handle, so none of such caller can cross transaction boundary. In Section A, every caller just found no DEAD bit, and grab reloc_root. In the cross section A-B, caller may get no DEAD bit, but since reloc_root is still completely valid thus accessing reloc_root is completely safe. No test_bit() caller can cross the boundary of Section B and Section C. In Section C, every caller found the DEAD bit, so no one will access reloc_root. In the cross section C-D, either caller gets the DEAD bit set, avoiding access reloc_root no matter if it's safe or not. Or caller get the DEAD bit cleared, then access reloc_root, which is already NULL, nothing will be wrong. The memory write barriers are between the reloc_root updates and bit set/clear, the pairing read side is before test_bit. Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Fixes: `d2311e6985` ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ barriers ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-13 23:10:56 +01:00
Sargun Dhillon	5e876fb43d	vfs, fdtable: Add fget_task helper This introduces a function which can be used to fetch a file, given an arbitrary task. As long as the user holds a reference (refcnt) to the task_struct it is safe to call, and will either return NULL on failure, or a pointer to the file, with a refcnt. This patch is based on Oleg Nesterov's (cf. [1]) patch from September 2018. [1]: Link: https://lore.kernel.org/r/20180915160423.GA31461@redhat.com Signed-off-by: Sargun Dhillon <sargun@sargun.me> Suggested-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20200107175927.4558-2-sargun@sargun.me Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2020-01-13 21:48:42 +01:00
Gao Xiang	4d2024370d	erofs: fix out-of-bound read for shifted uncompressed block rq->out[1] should be valid before accessing. Otherwise, in very rare cases, out-of-bound dirty onstack rq->out[1] can equal to *in and lead to unintended memmove behavior. Link: https://lore.kernel.org/r/20200107022546.19432-1-gaoxiang25@huawei.com Fixes: `7fc45dbc93` ("staging: erofs: introduce generic decompression backend") Cc: <stable@vger.kernel.org> # 5.3+ Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>	2020-01-11 18:29:19 +08:00
Linus Torvalds	9fb7007de8	Char/Misc patch for 5.5-rc6 Here is a single fix, for the chrdev core, for 5.5-rc6 There's been a long-standing race condition triggered by syzbot, and occasionally real people, in the chrdev open() path. Will finally took the time to track it down and fix it for real before the holidays. Here's that one patch, it's been in linux-next for a while with no reported issues and it does fix the reported problem. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXhjcRA8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ykIyQCfcrNOyyFktEj7/qiVJrMLbzVWoWYAoMHtNQcG 3IYmNNJ+eXXJEiOgeZ4J =J0bS -----END PGP SIGNATURE----- Merge tag 'char-misc-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc fix from Greg KH: "Here is a single fix, for the chrdev core, for 5.5-rc6 There's been a long-standing race condition triggered by syzbot, and occasionally real people, in the chrdev open() path. Will finally took the time to track it down and fix it for real before the holidays. Here's that one patch, it's been in linux-next for a while with no reported issues and it does fix the reported problem" * tag 'char-misc-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: chardev: Avoid potential use-after-free in 'chrdev_open()'	2020-01-10 13:25:24 -08:00
Linus Torvalds	4e4cd21c64	block-5.5-2020-01-10 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl4YvdoQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpoOuD/0eZkvtim/ZyEzj081PZUkslWwjNvEZV9+o iYWMp0PBDyYgR79ca86EVTevcMiVxPnKpQl5DT+p9L1JzZ7dFc8U7fTpygjwbzx0 FdHlPQt+oN4TsGl3hpTGGnw2ArbCHnqqj31ahgo87zo0a01xv33C3QeGFyXEIYoR F0QQ5E7EAyT2umKKflX9PWnrbOQZ91p2P3m+AE0TtOXMUgTi2KJKHUFu+G5OOwZB dM41GvyZDY9WA7bUlFeOp0mRZsiGkfEsI59VP6AR8ZkxwsOeHLrVB5iBEGPiTDL2 dUwLwbGrLYFtwLEh4yd0aKt9++H2RZjJwi4ssyaDkkWCMQHECQXwd34DBmrV/qia hgh/4DV0E1X3MZFYOk44zp8kwjgpmU9MCH3dFU0bWnzm9WrvtS9uBDjgEDkn6zty xONSQeyHWVFQFwIjG260YEbuTplOTFP5rNWEf2CHWMHuk9kp8kfATWt9wlazYhtz OUELfWmkrGk8nqMN4Ee+ty582I8gxk48IGwiJHOYh1gMHHgFnJbgr97Pe2NCLLee 9elkJnUQSdXuF314uznrAf7XLiEC0hfHGnPCTD8pAMf2DYOGKNeivh2wtdhd98cu AvHew9qnI2C/oahY+DaE4wunP4VlNZ/ZeNAl0h8KG7uNBCA4uFtyBMNimnKVNblw 7KDsQ3vDIg== =1bsG -----END PGP SIGNATURE----- Merge tag 'block-5.5-2020-01-10' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "A few fixes that should go into this round. This pull request contains two NVMe fixes via Keith, removal of a dead function, and a fix for the bio op for read truncates (Ming)" * tag 'block-5.5-2020-01-10' of git://git.kernel.dk/linux-block: nvmet: fix per feat data len for get_feature nvme: Translate more status codes to blk_status_t fs: move guard_bio_eod() after bio_set_op_attrs block: remove unused mp_bvec_last_segment	2020-01-10 12:05:26 -08:00
Linus Torvalds	30b6487d15	io_uring-5.5-2020-01-10 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl4YvQsQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpkf2D/9nY2CFUfVtYKv2nNpFTXSnTIT2r9WZEXfI vD75OF/w3RxRUJHyJNKw8YIP8fRDmpqzvryaPjb33o1nXwu2ioKe1Yrd63vbXd41 vr6X+jAp7iieO1PoG/QFb7p7ZBuhnFFKor/1scoD+4SQEOTtmwroeL6ECw1LdQGA A8F5nQzHVzrfhXNconPlXuXKVpoErP4AQ4ls5YmQI3AugK4XEYeRpRuGWuecdjJM In3TJ47heNPOCU1Eo7MO7lkFBR+CPi4uqsdFgsYJl7F1KiuwHL0e7qsWvmUMxz+9 Olq0hAVFAXLY6J8uEz+dBKN7Xu0zdsG2ILfOJrDUjbGQ+/aPb++08W162wxg1HlP +WKjpZtQpEQ3lUwk/e2fplSTC9bCA/90h/tmwjL3Olbl9+H5mxnovTevcxDX+L6x sBfHRlDPo8r1r71jXOjCKSYYLwEBHGmAA1DTXQt11386oIwP4RFXaregbkxRxvAu qqaChgnm9wb6wyApxYnU8r14qtRwN7lpK0STJB+krqoMywKpeJPirjoTDgMeovgf 1VYDDHz/PYkSWo+nRQWtXPK0k6sk2K2YnM5t4O0LZC2Pskc6C7V5PT7ZUyTw/Hfi 7IbWu5FCzsgCwC4UEhCcd9kRtFFQceZGW/SA1+VU98EyVUKrewXmq0NpmSiNfscV 0cdAHuVHhw== =TzX1 -----END PGP SIGNATURE----- Merge tag 'io_uring-5.5-2020-01-10' of git://git.kernel.dk/linux-block Pull io_uring fix from Jens Axboe: "Single fix for this series, fixing a regression with the short read handling. This just removes it, as it cannot safely be done for all cases" * tag 'io_uring-5.5-2020-01-10' of git://git.kernel.dk/linux-block: io_uring: remove punt of short reads to async context	2020-01-10 12:03:12 -08:00
Linus Torvalds	bef1d88263	pstore fix for rare error path - Fix label allocation lifetime/visibility to avoid further mistakes -----BEGIN PGP SIGNATURE----- Comment: Kees Cook <kees@outflux.net> iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl4YAM8WHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJl9mD/9il1pBZmLTQFVzfLeVnDemQ1t1 zhPywuR/FzE6KXX4kqaRJZ2ttLsHFaGqttbegQ5Jh8g5SdHZgMWRIz9y9Mb/re4g 5ny19VH4dvJHY6CYQxl3SlL+uSx29dNQOC4XGRtVwLrs6XeCIoWhXw1KI6ahiuVW AK1lelPlLIdVOLrvDn3qHy6iQBXgEA0PSosxUEjxC43ekoIRwKwsMKCeT3awHrOY uTamIU8vXwqCl+eFsgoQBoq8dgBvNI121DbKB2rToC2Pb3cE84Ez2Ul7hrqDQqzo 0wLPYJhvzMDqCaRwc5a42PqYFmRxSZrNHVbAnA9n/gEFPluJTklsq8fOEkXYml16 BKiC96PcKYfJZdEePbNx3tQVtXYj/mlncLKb8FxIlybyC+P92XGlHWDkrXyva8uY 6XyTyGpkEN1zwZnA+R6otvaOX8+I5XuX4vsRRlTaDd+1Y6SY7HxvMjxFuTko0yfY y/X0rfrhzvtrXUCoC+bkaA73EwaSXpie+8goFWQz9Erjpl7o0lH4lCIqm7J0hynM 0kmlusSEfqNQ8HBnbP25BVjW4Ws/ASrr5yGDNTigsaYeCdHkwYy+Ses50I9gMWuB oNz7jUblC0rPlO1geBQTFpec2u+9sS7Y+xhYhh+h8GLwCX09zOB8gUTE+GBgVhsY ptjh+yD7ybBiXrX8jQ== =biFc -----END PGP SIGNATURE----- Merge tag 'pstore-v5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull pstore fix from Kees Cook: "Cengiz Can forwarded a Coverity report about more problems with a rare pstore initialization error path, so the allocation lifetime was rearranged to avoid needing to share the kfree() responsibilities between caller and callee" * tag 'pstore-v5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: pstore/ram: Regularize prz label allocation lifetime	2020-01-09 21:42:05 -08:00
Alan Maguire	c475c77d5b	kunit: allow kunit tests to be loaded as a module As tests are added to kunit, it will become less feasible to execute all built tests together. By supporting modular tests we provide a simple way to do selective execution on a running system; specifying CONFIG_KUNIT=y CONFIG_KUNIT_EXAMPLE_TEST=m ...means we can simply "insmod example-test.ko" to run the tests. To achieve this we need to do the following: o export the required symbols in kunit o string-stream tests utilize non-exported symbols so for now we skip building them when CONFIG_KUNIT_TEST=m. o drivers/base/power/qos-test.c contains a few unexported interface references, namely freq_qos_read_value() and freq_constraints_init(). Both of these could be potentially defined as static inline functions in include/linux/pm_qos.h, but for now we simply avoid supporting module build for that test suite. o support a new way of declaring test suites. Because a module cannot do multiple late_initcall()s, we provide a kunit_test_suites() macro to declare multiple suites within the same module at once. o some test module names would have been too general ("test-test" and "example-test" for kunit tests, "inode-test" for ext4 tests); rename these as appropriate ("kunit-test", "kunit-example-test" and "ext4-inode-test" respectively). Also define kunit_test_suite() via kunit_test_suites() as callers in other trees may need the old definition. Co-developed-by: Knut Omang <knut.omang@oracle.com> Signed-off-by: Knut Omang <knut.omang@oracle.com> Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Reviewed-by: Brendan Higgins <brendanhiggins@google.com> Acked-by: Theodore Ts'o <tytso@mit.edu> # for ext4 bits Acked-by: David Gow <davidgow@google.com> # For list-test Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>	2020-01-09 16:42:29 -07:00
Ming Lei	83c9c54716	fs: move guard_bio_eod() after bio_set_op_attrs Commit `85a8ce62c2` ("block: add bio_truncate to fix guard_bio_eod") adds bio_truncate() for handling bio EOD. However, bio_truncate() doesn't use the passed 'op' parameter from guard_bio_eod's callers. So bio_trunacate() may retrieve wrong 'op', and zering pages may not be done for READ bio. Fixes this issue by moving guard_bio_eod() after bio_set_op_attrs() in submit_bh_wbc() so that bio_truncate() can always retrieve correct op info. Meantime remove the 'op' parameter from guard_bio_eod() because it isn't used any more. Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: linux-fsdevel@vger.kernel.org Fixes: `85a8ce62c2` ("block: add bio_truncate to fix guard_bio_eod") Signed-off-by: Ming Lei <ming.lei@redhat.com> Fold in kerneldoc and bio_op() change. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-09 08:16:12 -07:00
Kees Cook	e163fdb3f7	pstore/ram: Regularize prz label allocation lifetime In my attempt to fix a memory leak, I introduced a double-free in the pstore error path. Instead of trying to manage the allocation lifetime between persistent_ram_new() and its callers, adjust the logic so persistent_ram_new() always takes a kstrdup() copy, and leaves the caller's allocation lifetime up to the caller. Therefore callers are _always_ responsible for freeing their label. Before, it only needed freeing when the prz itself failed to allocate, and not in any of the other prz failure cases, which callers would have no visibility into, which is the root design problem that lead to both the leak and now double-free bugs. Reported-by: Cengiz Can <cengiz@kernel.wtf> Link: https://lore.kernel.org/lkml/d4ec59002ede4aaf9928c7f7526da87c@kernel.wtf Fixes: `8df955a32a` ("pstore/ram: Fix error-path memory leak in persistent_ram_new() callers") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>	2020-01-08 17:05:45 -08:00
Johannes Thumshirn	26ef8493e1	btrfs: fix memory leak in qgroup accounting When running xfstests on the current btrfs I get the following splat from kmemleak: unreferenced object 0xffff88821b2404e0 (size 32): comm "kworker/u4:7", pid 26663, jiffies 4295283698 (age 8.776s) hex dump (first 32 bytes): 01 00 00 00 00 00 00 00 10 ff fd 26 82 88 ff ff ...........&.... 10 ff fd 26 82 88 ff ff 20 ff fd 26 82 88 ff ff ...&.... ..&.... backtrace: [<00000000f94fd43f>] ulist_alloc+0x25/0x60 [btrfs] [<00000000fd023d99>] btrfs_find_all_roots_safe+0x41/0x100 [btrfs] [<000000008f17bd32>] btrfs_find_all_roots+0x52/0x70 [btrfs] [<00000000b7660afb>] btrfs_qgroup_rescan_worker+0x343/0x680 [btrfs] [<0000000058e66778>] btrfs_work_helper+0xac/0x1e0 [btrfs] [<00000000f0188930>] process_one_work+0x1cf/0x350 [<00000000af5f2f8e>] worker_thread+0x28/0x3c0 [<00000000b55a1add>] kthread+0x109/0x120 [<00000000f88cbd17>] ret_from_fork+0x35/0x40 This corresponds to: (gdb) l (btrfs_find_all_roots_safe+0x41) 0x8d7e1 is in btrfs_find_all_roots_safe (fs/btrfs/backref.c:1413). 1408 1409 tmp = ulist_alloc(GFP_NOFS); 1410 if (!tmp) 1411 return -ENOMEM; 1412 roots = ulist_alloc(GFP_NOFS); 1413 if (!*roots) { 1414 ulist_free(tmp); 1415 return -ENOMEM; 1416 } 1417 Following the lifetime of the allocated 'roots' ulist, it gets freed again in btrfs_qgroup_account_extent(). But this does not happen if the function is called with the 'BTRFS_FS_QUOTA_ENABLED' flag cleared, then btrfs_qgroup_account_extent() does a short leave and directly returns. Instead of directly returning we should jump to the 'out_free' in order to free all resources as expected. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-08 17:56:17 +01:00
Josef Bacik	423a716cd7	btrfs: do not delete mismatched root refs btrfs_del_root_ref() will simply WARN_ON() if the ref doesn't match in any way, and then continue to delete the reference. This shouldn't happen, we have these values because there's more to the reference than the original root and the sub root. If any of these checks fail, return -ENOENT. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-08 14:44:24 +01:00
Josef Bacik	d49d3287e7	btrfs: fix invalid removal of root ref If we have the following sequence of events btrfs sub create A btrfs sub create A/B btrfs sub snap A C mkdir C/foo mv A/B C/foo rm -rf * We will end up with a transaction abort. The reason for this is because we create a root ref for B pointing to A. When we create a snapshot of C we still have B in our tree, but because the root ref points to A and not C we will make it appear to be empty. The problem happens when we move B into C. This removes the root ref for B pointing to A and adds a ref of B pointing to C. When we rmdir C we'll see that we have a ref to our root and remove the root ref, despite not actually matching our reference name. Now btrfs_del_root_ref() allowing this to work is a bug as well, however we know that this inode does not actually point to a root ref in the first place, so we shouldn't be calling btrfs_del_root_ref() in the first place and instead simply look up our dir index for this item and do the rest of the removal. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-08 14:44:23 +01:00
Josef Bacik	045d3967b6	btrfs: rework arguments of btrfs_unlink_subvol btrfs_unlink_subvol takes the name of the dentry and the root objectid based on what kind of inode this is, either a real subvolume link or a empty one that we inherited as a snapshot. We need to fix how we unlink in the case for BTRFS_EMPTY_SUBVOL_DIR_OBJECTID in the future, so rework btrfs_unlink_subvol to just take the dentry and handle getting the right objectid given the type of inode this is. There is no functional change here, simply pushing the work into btrfs_unlink_subvol() proper. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-08 14:43:34 +01:00
Jens Axboe	eacc6dfaea	io_uring: remove punt of short reads to async context We currently punt any short read on a regular file to async context, but this fails if the short read is due to running into EOF. This is especially problematic since we only do the single prep for commands now, as we don't reset kiocb->ki_pos. This can result in a 4k read on a 1k file returning zero, as we detect the short read and then retry from async context. At the time of retry, the position is now 1k, and we end up reading nothing, and hence return 0. Instead of trying to patch around the fact that short reads can be legitimate and won't succeed in case of retry, remove the logic to punt a short read to async context. Simply return it. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-07 13:08:56 -07:00
Vladimir Zapolskiy	e3915ad94b	erofs: remove void tagging/untagging of workgroup pointers Because workgroup pointers inserted to a radix tree are always tagged with a single value of 0, it is possible to remove tagging and untagging of the pointers completely. Signed-off-by: Vladimir Zapolskiy <vladimir@tuxera.com> Link: https://lore.kernel.org/r/20200102120118.14979-4-vladimir@tuxera.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>	2020-01-07 10:09:21 +08:00
Vladimir Zapolskiy	e5e9a43203	erofs: remove unused tag argument while registering a workgroup All workgroups are registered with tag value set to 0, to simplify erofs_register_workgroup() interface the tag argument can be removed, if its only value is sent down to the function body. Signed-off-by: Vladimir Zapolskiy <vladimir@tuxera.com> Link: https://lore.kernel.org/r/20200102120118.14979-3-vladimir@tuxera.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>	2020-01-07 10:08:45 +08:00
Vladimir Zapolskiy	997626d838	erofs: remove unused tag argument while finding a workgroup It is feasible to simplify erofs_find_workgroup() interface by removing an unused function argument. While formally the argument is used in the function itself, its assigned value is ignored on the caller side. Signed-off-by: Vladimir Zapolskiy <vladimir@tuxera.com> Link: https://lore.kernel.org/r/20200102120118.14979-2-vladimir@tuxera.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>	2020-01-07 10:08:38 +08:00
Vladimir Zapolskiy	a55861c800	erofs: correct indentation of an assigned structure inside a function Trivial change, the expected indentation ruled by the coding style hasn't been met. Signed-off-by: Vladimir Zapolskiy <vladimir@tuxera.com> Link: https://lore.kernel.org/r/20200102120232.15074-1-vladimir@tuxera.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>	2020-01-07 10:08:27 +08:00
Daniel W. S. Almeida	adc92dd455	debugfs: Fix warnings when building documentation Fix the following warnings: fs/debugfs/inode.c:423: WARNING: Inline literal start-string without end-string. fs/debugfs/inode.c:502: WARNING: Inline literal start-string without end-string. fs/debugfs/inode.c:534: WARNING: Inline literal start-string without end-string. fs/debugfs/inode.c:627: WARNING: Inline literal start-string without end-string. fs/debugfs/file.c:496: WARNING: Inline literal start-string without end-string. fs/debugfs/file.c:502: WARNING: Inline literal start-string without end-string. fs/debugfs/file.c:581: WARNING: Inline literal start-string without end-string. fs/debugfs/file.c:587: WARNING: Inline literal start-string without end-string. fs/debugfs/file.c:846: WARNING: Inline literal start-string without end-string. fs/debugfs/file.c:852: WARNING: Inline literal start-string without end-string. fs/debugfs/file.c:899: WARNING: Inline literal start-string without end-string. fs/debugfs/file.c:905: WARNING: Inline literal start-string without end-string. fs/debugfs/file.c:1091: WARNING: Inline literal start-string without end-string. fs/debugfs/file.c:1097: WARNING: Inline literal start-string without end-string By replacing %ERR_PTR with ERR_PTR. Signed-off-by: Daniel W. S. Almeida <dwlsalmeida@gmail.com> Link: https://lore.kernel.org/r/20191227010035.854913-1-dwlsalmeida@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-01-06 20:35:37 +01:00
Will Deacon	68faa679b8	chardev: Avoid potential use-after-free in 'chrdev_open()' 'chrdev_open()' calls 'cdev_get()' to obtain a reference to the 'struct cdev ' stashed in the 'i_cdev' field of the target inode structure. If the pointer is NULL, then it is initialised lazily by looking up the kobject in the 'cdev_map' and so the whole procedure is protected by the 'cdev_lock' spinlock to serialise initialisation of the shared pointer. Unfortunately, it is possible for the initialising thread to fail after* installing the new pointer, for example if the subsequent '->open()' call on the file fails. In this case, 'cdev_put()' is called, the reference count on the kobject is dropped and, if nobody else has taken a reference, the release function is called which finally clears 'inode->i_cdev' from 'cdev_purge()' before potentially freeing the object. The problem here is that a racing thread can happily take the 'cdev_lock' and see the non-NULL pointer in the inode, which can result in a refcount increment from zero and a warning: \| ------------[ cut here ]------------ \| refcount_t: addition on 0; use-after-free. \| WARNING: CPU: 2 PID: 6385 at lib/refcount.c:25 refcount_warn_saturate+0x6d/0xf0 \| Modules linked in: \| CPU: 2 PID: 6385 Comm: repro Not tainted 5.5.0-rc2+ #22 \| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 \| RIP: 0010:refcount_warn_saturate+0x6d/0xf0 \| Code: 05 55 9a 15 01 01 e8 9d aa c8 ff 0f 0b c3 80 3d 45 9a 15 01 00 75 ce 48 c7 c7 00 9c 62 b3 c6 08 \| RSP: 0018:ffffb524c1b9bc70 EFLAGS: 00010282 \| RAX: 0000000000000000 RBX: ffff9e9da1f71390 RCX: 0000000000000000 \| RDX: ffff9e9dbbd27618 RSI: ffff9e9dbbd18798 RDI: ffff9e9dbbd18798 \| RBP: 0000000000000000 R08: 000000000000095f R09: 0000000000000039 \| R10: 0000000000000000 R11: ffffb524c1b9bb20 R12: ffff9e9da1e8c700 \| R13: ffffffffb25ee8b0 R14: 0000000000000000 R15: ffff9e9da1e8c700 \| FS: 00007f3b87d26700(0000) GS:ffff9e9dbbd00000(0000) knlGS:0000000000000000 \| CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 \| CR2: 00007fc16909c000 CR3: 000000012df9c000 CR4: 00000000000006e0 \| DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 \| DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 \| Call Trace: \| kobject_get+0x5c/0x60 \| cdev_get+0x2b/0x60 \| chrdev_open+0x55/0x220 \| ? cdev_put.part.3+0x20/0x20 \| do_dentry_open+0x13a/0x390 \| path_openat+0x2c8/0x1470 \| do_filp_open+0x93/0x100 \| ? selinux_file_ioctl+0x17f/0x220 \| do_sys_open+0x186/0x220 \| do_syscall_64+0x48/0x150 \| entry_SYSCALL_64_after_hwframe+0x44/0xa9 \| RIP: 0033:0x7f3b87efcd0e \| Code: 89 54 24 08 e8 a3 f4 ff ff 8b 74 24 0c 48 8b 3c 24 41 89 c0 44 8b 54 24 08 b8 01 01 00 00 89 f4 \| RSP: 002b:00007f3b87d259f0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101 \| RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3b87efcd0e \| RDX: 0000000000000000 RSI: 00007f3b87d25a80 RDI: 00000000ffffff9c \| RBP: 00007f3b87d25e90 R08: 0000000000000000 R09: 0000000000000000 \| R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffe188f504e \| R13: 00007ffe188f504f R14: 00007f3b87d26700 R15: 0000000000000000 \| ---[ end trace 24f53ca58db8180a ]--- Since 'cdev_get()' can already fail to obtain a reference, simply move it over to use 'kobject_get_unless_zero()' instead of 'kobject_get()', which will cause the racing thread to return -ENXIO if the initialising thread fails unexpectedly. Cc: Hillf Danton <hdanton@sina.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Reported-by: syzbot+82defefbbd8527e1c2cb@syzkaller.appspotmail.com Signed-off-by: Will Deacon <will@kernel.org> Cc: stable <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20191219120203.32691-1-will@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-01-06 20:10:26 +01:00
Gang He	b73eba2a86	ocfs2: fix the crash due to call ocfs2_get_dlm_debug once less Because ocfs2_get_dlm_debug() function is called once less here, ocfs2 file system will trigger the system crash, usually after ocfs2 file system is unmounted. This system crash is caused by a generic memory corruption, these crash backtraces are not always the same, for exapmle, ocfs2: Unmounting device (253,16) on (node 172167785) general protection fault: 0000 [#1] SMP PTI CPU: 3 PID: 14107 Comm: fence_legacy Kdump: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) RIP: 0010:__kmalloc+0xa5/0x2a0 Code: 00 00 4d 8b 07 65 4d 8b RSP: 0018:ffffaa1fc094bbe8 EFLAGS: 00010286 RAX: 0000000000000000 RBX: d310a8800d7a3faf RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000dc0 RDI: ffff96e68fc036c0 RBP: d310a8800d7a3faf R08: ffff96e6ffdb10a0 R09: 00000000752e7079 R10: 000000000001c513 R11: 0000000004091041 R12: 0000000000000dc0 R13: 0000000000000039 R14: ffff96e68fc036c0 R15: ffff96e68fc036c0 FS: 00007f699dfba540(0000) GS:ffff96e6ffd80000(0000) knlGS:00000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055f3a9d9b768 CR3: 000000002cd1c000 CR4: 00000000000006e0 Call Trace: ext4_htree_store_dirent+0x35/0x100 [ext4] htree_dirblock_to_tree+0xea/0x290 [ext4] ext4_htree_fill_tree+0x1c1/0x2d0 [ext4] ext4_readdir+0x67c/0x9d0 [ext4] iterate_dir+0x8d/0x1a0 __x64_sys_getdents+0xab/0x130 do_syscall_64+0x60/0x1f0 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f699d33a9fb This regression problem was introduced by commit `e581595ea2` ("ocfs: no need to check return value of debugfs_create functions"). Link: http://lkml.kernel.org/r/20191225061501.13587-1-ghe@suse.com Fixes: `e581595ea2` ("ocfs: no need to check return value of debugfs_create functions") Signed-off-by: Gang He <ghe@suse.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> [5.3+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-01-04 13:55:09 -08:00
Kai Li	397eac17f8	ocfs2: call journal flush to mark journal as empty after journal recovery when mount If journal is dirty when mount, it will be replayed but jbd2 sb log tail cannot be updated to mark a new start because journal->j_flag has already been set with JBD2_ABORT first in journal_init_common. When a new transaction is committed, it will be recored in block 1 first(journal->j_tail is set to 1 in journal_reset). If emergency restart happens again before journal super block is updated unfortunately, the new recorded trans will not be replayed in the next mount. The following steps describe this procedure in detail. 1. mount and touch some files 2. these transactions are committed to journal area but not checkpointed 3. emergency restart 4. mount again and its journals are replayed 5. journal super block's first s_start is 1, but its s_seq is not updated 6. touch a new file and its trans is committed but not checkpointed 7. emergency restart again 8. mount and journal is dirty, but trans committed in 6 will not be replayed. This exception happens easily when this lun is used by only one node. If it is used by multi-nodes, other node will replay its journal and its journal super block will be updated after recovery like what this patch does. ocfs2_recover_node->ocfs2_replay_journal. The following jbd2 journal can be generated by touching a new file after journal is replayed, and seq 15 is the first valid commit, but first seq is 13 in journal super block. logdump: Block 0: Journal Superblock Seq: 0 Type: 4 (JBD2_SUPERBLOCK_V2) Blocksize: 4096 Total Blocks: 32768 First Block: 1 First Commit ID: 13 Start Log Blknum: 1 Error: 0 Feature Compat: 0 Feature Incompat: 2 block64 Feature RO compat: 0 Journal UUID: 4ED3822C54294467A4F8E87D2BA4BC36 FS Share Cnt: 1 Dynamic Superblk Blknum: 0 Per Txn Block Limit Journal: 0 Data: 0 Block 1: Journal Commit Block Seq: 14 Type: 2 (JBD2_COMMIT_BLOCK) Block 2: Journal Descriptor Seq: 15 Type: 1 (JBD2_DESCRIPTOR_BLOCK) No. Blocknum Flags 0. 587 none UUID: 00000000000000000000000000000000 1. 8257792 JBD2_FLAG_SAME_UUID 2. 619 JBD2_FLAG_SAME_UUID 3. 24772864 JBD2_FLAG_SAME_UUID 4. 8257802 JBD2_FLAG_SAME_UUID 5. 513 JBD2_FLAG_SAME_UUID JBD2_FLAG_LAST_TAG ... Block 7: Inode Inode: 8257802 Mode: 0640 Generation: 57157641 (0x3682809) FS Generation: 2839773110 (0xa9437fb6) CRC32: 00000000 ECC: 0000 Type: Regular Attr: 0x0 Flags: Valid Dynamic Features: (0x1) InlineData User: 0 (root) Group: 0 (root) Size: 7 Links: 1 Clusters: 0 ctime: 0x5de5d870 0x11104c61 -- Tue Dec 3 11:37:20.286280801 2019 atime: 0x5de5d870 0x113181a1 -- Tue Dec 3 11:37:20.288457121 2019 mtime: 0x5de5d870 0x11104c61 -- Tue Dec 3 11:37:20.286280801 2019 dtime: 0x0 -- Thu Jan 1 08:00:00 1970 ... Block 9: Journal Commit Block Seq: 15 Type: 2 (JBD2_COMMIT_BLOCK) The following is journal recovery log when recovering the upper jbd2 journal when mount again. syslog: ocfs2: File system on device (252,1) was not unmounted cleanly, recovering it. fs/jbd2/recovery.c:(do_one_pass, 449): Starting recovery pass 0 fs/jbd2/recovery.c:(do_one_pass, 449): Starting recovery pass 1 fs/jbd2/recovery.c:(do_one_pass, 449): Starting recovery pass 2 fs/jbd2/recovery.c:(jbd2_journal_recover, 278): JBD2: recovery, exit status 0, recovered transactions 13 to 13 Due to first commit seq 13 recorded in journal super is not consistent with the value recorded in block 1(seq is 14), journal recovery will be terminated before seq 15 even though it is an unbroken commit, inode 8257802 is a new file and it will be lost. Link: http://lkml.kernel.org/r/20191217020140.2197-1-li.kai4@h3c.com Signed-off-by: Kai Li <li.kai4@h3c.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Changwei Ge <gechangwei@live.cn> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-01-04 13:55:09 -08:00
Randy Dunlap	e39e773ad1	fs/posix_acl.c: fix kernel-doc warnings Fix kernel-doc warnings in fs/posix_acl.c. Also fix one typo (setgit -> setgid). fs/posix_acl.c:647: warning: Function parameter or member 'inode' not described in 'posix_acl_update_mode' fs/posix_acl.c:647: warning: Function parameter or member 'mode_p' not described in 'posix_acl_update_mode' fs/posix_acl.c:647: warning: Function parameter or member 'acl' not described in 'posix_acl_update_mode' Link: http://lkml.kernel.org/r/29b0dc46-1f28-a4e5-b1d0-ba2b65629779@infradead.org Fixes: `073931017b` ("posix_acl: Clear SGID bit when setting file permissions") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-01-04 13:55:09 -08:00
Eric Biggers	213921f967	fs/namespace.c: make to_mnt_ns() static Make to_mnt_ns() static to address the following 'sparse' warning: fs/namespace.c:1731:22: warning: symbol 'to_mnt_ns' was not declared. Should it be static? Link: http://lkml.kernel.org/r/20191209234830.156260-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-01-04 13:55:09 -08:00
Eric Biggers	7bebd69ecf	fs/nsfs.c: include headers for missing declarations Include linux/proc_fs.h and fs/internal.h to address the following 'sparse' warnings: fs/nsfs.c:41:32: warning: symbol 'ns_dentry_operations' was not declared. Should it be static? fs/nsfs.c:145:5: warning: symbol 'open_related_ns' was not declared. Should it be static? Link: http://lkml.kernel.org/r/20191209234822.156179-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-01-04 13:55:09 -08:00
Eric Biggers	b16155a0b0	fs/direct-io.c: include fs/internal.h for missing prototype Include fs/internal.h to address the following 'sparse' warning: fs/direct-io.c:591:5: warning: symbol 'sb_init_dio_done_wq' was not declared. Should it be static? Link: http://lkml.kernel.org/r/20191209234544.128302-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-01-04 13:55:09 -08:00
Linus Torvalds	3a562aee72	for-5.5-rc4-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl4PenMACgkQxWXV+ddt WDtsyg/6A86Da1zpLQKHiQtfhH629oeL/lFI0Cz+52xcgKSAUX8pHrWzKawSVpTT OTgksskHREPDjyDQytxAUKJnuplFEF7sCFNhzhsjQ7cvhC0lLDmW0VQr1Wob1/c4 miW18YXN2DrYFQqdCeypvq1FItglxuBmsa9eITYzytRZcATY+3b506FhT/d8NIiH f3nnChUiNuyfAiez1SC4FavDpVma1O6SpPMk29wUN/+/Dnd4aBrfvhuygy7xyIOK rzotRR5xagQDpei+99lT2hys4Pv0yEOajoGmgbgpaZNP0vgQcBLaF5UTeMv/Ib2j i+muu1rWi4R6lS3hh30kRSifKmPF7I9JB1dbd5gPfZiERlDQk+qZWrPFv9l0lf7M R3jn04EaPVNr0dtftRFF2VvpIlUQYgIvwZHx4Td6Oy9XO0X4n5vlKxYg9aHmybJ2 Vni13ad2oWNLo2Dd1eUYrEJ/QwGc1pq7JU9HiOST7yEJv1FN++AF4qTYTyvc7Yl4 HrN349/2k2J9vU88BIlXIxYG73AL9lIpGF2ROiEw+xn4oOevDP/5HCP8ZvELoNVM NqhJGoURu+AhZ7Rm3RYAYySShjQPF+qw/Dnz8sowQsEFUQ990E1wUxWZwTOP6+JQ cdwzYiXYIExTghfQdUDdwNlpCXUbSuJxFOGjKmhBuzA7HHwGf1Q= =jGvW -----END PGP SIGNATURE----- Merge tag 'for-5.5-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few fixes for btrfs: - blkcg accounting problem with compression that could stall writes - setting up blkcg bio for compression crashes due to NULL bdev pointer - fix possible infinite loop in writeback for nocow files (here possible means almost impossible, 13 things that need to happen to trigger it)" * tag 'for-5.5-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix infinite loop during nocow writeback due to race btrfs: fix compressed write bio blkcg attribution btrfs: punt all bios created in btrfs_submit_compressed_write()	2020-01-03 12:20:21 -08:00
Linus Torvalds	b6b4aafc99	block-5.5-20200103 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl4PauoQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpkqMD/4zcRGP8LhkVG3I8QMqmhhJNyZb5WP5O/mV ekWYYzxaPhn/WQjM17Y6MBuvPTqUk1R/FiZdjW+OPUOrXUPsasSSKmLvcpiHIRHY YTMhzNNRpainakM7q+BwhfeNm/T3d0JXMfN8oufIfrdFHy7+bcZupctv0aQgnLdz Z8f40nR8t/uYfRrMzCjEM9mY2P9qelwP3ylnloS2XC0r+O/j/tBZEtnormhFwnfa NGx732WyHBK4Qt9bRK8FzmuUmUmhxU+MWHcR5erCCzeFEY4DnzzthFuKGvZ3Ai0f qWMjYJBrWHvNJL6M4Dm3zE/tVWIFg3//zo7buuZD29Ms3GoqGj88mbrS/BFKNp1+ U5cShEMqd/7y5RUB0nOzRnfGVVHz/2WRRKvhc0/cFHulJb0q28u5pHhmlrduvxr/ VogoSiimo7qLVGiBeXgfubuyB3nvzGvDICgiqt3rIVgpgMytoiBz53HgnNLB8QtO CtYDJOm8sCTtnkEPshij9Ly0dfIYUVKz7QIx65qSfcv+vxag/WmWuG5woRtqtB6N AoFqf8PpHXHtVrokbqAG4j4R7QrLSrYSTh/vDa6muTaiu7fe5Gb2gXKWPbIGU5n+ b+2oKqlpVN1ZqOvhUv/PXW0U/nMx2j5Untank1ebHd+416nfNqM5lN6jsfZ3HnaR tpST+JtLDg== =ll2G -----END PGP SIGNATURE----- Merge tag 'block-5.5-20200103' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "Three fixes in here: - Fix for a missing split on default memory boundary mask (4G) (Ming) - Fix for multi-page read bio truncate (Ming) - Fix for null_blk zone close request handling (Damien)" * tag 'block-5.5-20200103' of git://git.kernel.dk/linux-block: null_blk: Fix REQ_OP_ZONE_CLOSE handling block: fix splitting segments on boundary masks block: add bio_truncate to fix guard_bio_eod	2020-01-03 12:11:30 -08:00
Jan Stancek	15f0ec941f	mm/hugetlbfs: fix for_each_hstate() loop in init_hugetlbfs_fs() LTP memfd_create04 started failing for some huge page sizes after v5.4-10135-gc3bfc5dd73c6. The problem is the check introduced to for_each_hstate() loop that should skip default_hstate_idx. Since it doesn't update 'i' counter, all subsequent huge page sizes are skipped as well. Fixes: `8fc312b32b` ("mm/hugetlbfs: fix error handling when setting up mounts") Signed-off-by: Jan Stancek <jstancek@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-01-03 10:39:08 -08:00
Arnd Bergmann	77b9040195	compat_ioctl: simplify the implementation Now that both native and compat ioctl syscalls are in the same file, a couple of simplifications can be made, bringing the implementation closer together: - do_vfs_ioctl(), ioctl_preallocate(), and compat_ioctl_preallocate() can become static, allowing the compiler to optimize better - slightly update the coding style for consistency between the functions. - rather than listing each command in two switch statements for the compat case, just call a single function that has all the common commands. As a side-effect, FS_IOC_RESVSP/FS_IOC_RESVSP64 are now available to x86 compat tasks, along with FS_IOC_RESVSP_32/FS_IOC_RESVSP64_32. This is harmless for i386 emulation, and can be considered a bugfix for x32 emulation, which never supported these in the past. Reviewed-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2020-01-03 09:42:52 +01:00
Arnd Bergmann	2af563d071	compat_ioctl: move sys_compat_ioctl() to ioctl.c The rest of the fs/compat_ioctl.c file is no longer useful now, so move the actual syscall as planned. Reviewed-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2020-01-03 09:42:52 +01:00
Arnd Bergmann	d320a9551e	compat_ioctl: scsi: move ioctl handling into drivers Each driver calling scsi_ioctl() gets an equivalent compat_ioctl() handler that implements the same commands by calling scsi_compat_ioctl(). The scsi_cmd_ioctl() and scsi_cmd_blk_ioctl() functions are compatible at this point, so any driver that calls those can do so for both native and compat mode, with the argument passed through compat_ptr(). With this, we can remove the entries from fs/compat_ioctl.c. The new code is larger, but should be easier to maintain and keep updated with newly added commands. Reviewed-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Acked-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2020-01-03 09:42:32 +01:00
Linus Torvalds	278b14eb92	pstore bug fixes - always reset circular buffer state when writing new dump (Aleksandr Yashkin) - fix rare error-path memory leak (Kees Cook) -----BEGIN PGP SIGNATURE----- Comment: Kees Cook <kees@outflux.net> iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl4OWFsWHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJg3FD/0Vp/Qf0zfjGlV657kNhbYMKYSM h/ItNhFovnRnxobY2CzjZyYoU9ZCsv2bEhplILlJxG84l+nb3nUubHGhUVcoQ6iE aDJAdz632qB+R8l/Ouk0xUp/UnVvoheA68RCaeY6R7G/vAHqceoGD72Ji5kFe1T8 QxPqVAJAIn+hdBHtZVhuueqINQ+3nCUUwE4Yu0veqEG5wcf+awGDjDrha2ZsMyfM Zl54co7l5Z5MwLLTxXO8RRFXlGtmMXnLQcKUwdPIxi+ZBZRan2pdj2zX5Sr0eGA2 HtZchn9Bc5bbwERRobbwWzQeqbwfoRHMOWFq7mUHyaho/5Pbv67A0lOib2cuXG+U WjpXV9nLiAbVqIH0wph4Vppx5acDI7o0JRS0oct/f7trZE58zXfHLJHLqPrblV8J 3pr989yGBEYI8YCf2dJXIKQlT0+s24Iyby0RmkCwQC/DkRWf9+a6BBYLHvGS96Y2 gZ4wKGJRXbDP3m4NuojYRv3DwF71jtSw5bR1kLAdjEygbtxG064mTcACH+DsScKK 6zBbOA2qBf1ReFdz4KVN9TAOnPIsTgeBG5ttmzGgkKZUUAe9+65AJH8jS22NSD2S mn4S2hXio5hV3bMUJtItEzMjYwMkdeyac8EubPlLH4DoNdZ01ER6L07x83m5cPjl wZ845V9VZ14i4bkq8w== =yG+c -----END PGP SIGNATURE----- Merge tag 'pstore-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull pstore bug fixes from Kees Cook: - always reset circular buffer state when writing new dump (Aleksandr Yashkin) - fix rare error-path memory leak (Kees Cook) * tag 'pstore-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: pstore/ram: Write new dumps to start of recycled zones pstore/ram: Fix error-path memory leak in persistent_ram_new() callers	2020-01-02 16:39:51 -08:00
Dominik Brodowski	74f1a29910	Revert "fs: remove ksys_dup()" This reverts commit `8243186f0c` ("fs: remove ksys_dup()") and the subsequent fix for it in commit `2d3145f8d2` ("early init: fix error handling when opening /dev/console"). Trying to use filp_open() and f_dupfd() instead of pseudo-syscalls caused more trouble than what is worth it: it requires accessing vfs internals and it turns out there were other bugs in it too. In particular, the file reference counting was wrong - because unlike the original "open+2dup" sequence it used "filp_open+3f_dupfd" and thus had an extra leaked file reference. That in turn then caused odd problems with Androidx86 long after boot becaue of how the extra reference to the console kept the session active even after all file descriptors had been closed. Reported-by: youling 257 <youling257@gmail.com> Cc: Arvind Sankar <nivedita@alum.mit.edu> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-01-02 16:15:33 -08:00
Aleksandr Yashkin	9e5f1c1980	pstore/ram: Write new dumps to start of recycled zones The ram_core.c routines treat przs as circular buffers. When writing a new crash dump, the old buffer needs to be cleared so that the new dump doesn't end up in the wrong place (i.e. at the end). The solution to this problem is to reset the circular buffer state before writing a new Oops dump. Signed-off-by: Aleksandr Yashkin <a.yashkin@inango-systems.com> Signed-off-by: Nikolay Merinov <n.merinov@inango-systems.com> Signed-off-by: Ariel Gilman <a.gilman@inango-systems.com> Link: https://lore.kernel.org/r/20191223133816.28155-1-n.merinov@inango-systems.com Fixes: `896fc1f0c4` ("pstore/ram: Switch to persistent_ram routines") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>	2020-01-02 12:30:50 -08:00

... 2 3 4 5 6 ...

62300 Commits