If a new hugetlb page is allocated during fallocate it will not be
marked as active (set_page_huge_active) which will result in a later
isolate_huge_page failure when the page migration code would like to
move that page. Such a failure would be unexpected and wrong.
Only export set_page_huge_active, just leave clear_page_huge_active as
static. Because there are no external users.
Link: https://lkml.kernel.org/r/20210115124942.46403-3-songmuchun@bytedance.com
Fixes: 70c3547e36 (hugetlbfs: add hugetlbfs_fallocate())
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYBuyTQAKCRDh3BK/laaZ
PBBhAPwLy3ksQLhY7in4I8aKrSyWRpaCSAeLQUitxnX3eQiQnAD/S1EEIapwradV
y4ou1PBRsGnhwNgArXODVCcTgqDJqw8=
=GjU4
-----END PGP SIGNATURE-----
Merge tag 'ovl-fixes-5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
- Fix capability conversion and minor overlayfs bugs that are related
to the unprivileged overlay mounts introduced in this cycle.
- Fix two recent (v5.10) and one old (v4.10) bug.
- Clean up security xattr copy-up (related to a SELinux regression).
* tag 'ovl-fixes-5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: implement volatile-specific fsync error behaviour
ovl: skip getxattr of security labels
ovl: fix dentry leak in ovl_get_redirect
ovl: avoid deadlock on directory ioctl
cap: fix conversions on getxattr
ovl: perform vfs_getxattr() with mounter creds
ovl: add warning on user_ns mismatch
trees.
Current release - regressions:
- ip_tunnel: fix mtu calculation
- mlx5: fix function calculation for page trees
Previous releases - regressions:
- vsock: fix the race conditions in multi-transport support
- neighbour: prevent a dead entry from updating gc_list
- dsa: mv88e6xxx: override existent unicast portvec in port_fdb_add
Previous releases - always broken:
- bpf, cgroup: two copy_{from,to}_user() warn_on_once splats for BPF
cgroup getsockopt infra when user space is trying
to race against optlen, from Loris Reiff.
- bpf: add missing fput() in BPF inode storage map update helper
- udp: ipv4: manipulate network header of NATed UDP GRO fraglist
- mac80211: fix station rate table updates on assoc
- r8169: work around RTL8125 UDP HW bug
- igc: report speed and duplex as unknown when device is runtime
suspended
- rxrpc: fix deadlock around release of dst cached on udp tunnel
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmAZjwQACgkQMUZtbf5S
IruLbQ//Yg9+xEnqhDuOJZtYHB0rsJjLlKmtvgOsBr8BaTcUEPoPoqUPm+EMvCHb
o1fFa1qIrbS5luVEofu9hNX7DGXwvgawaMW2TympJhqLZQqjazCMB/st99LphhJw
RvaZI8aDOikosT4c+I0vm83jDQETonrjziIcPfHHPjn/Q+amGRRRXiTSQnRF/MlU
oARCG+U3kHsHBDUPNSCtSjKXshoZPjFb/pD7fQAlzzm7CssvbPhNWbducueyP2Fb
XW4RwJu9QBBH2JS6uZJ1Y6LVoRzusmE9dUam3KhkiL/CHs72lWPsc+Rn5gbBPvc5
Y4T4h61Xti1O4ULKdqhGceror6XY+4Qb1VlHWWztOhIo00wIAv3IHbTup/4o0HBr
j84MtcyOl/qxSFXjunPJkbWJngXikrkIMS0Bl6ZcPAejYM9wN6vCgbvFCHbEg1Rx
cWFnYyS9FCLduaxHSizv050tWhknOdX+zHK3fOtlW0yWnreJAB8Hoc21Zm7YKvg0
GxxcGK6AhqJ6s2ixVDv7MyJrltJ/hOJQb+T3HgHFuY2BYUs8F2r/HoHU/u4uCl76
RdBzbC/sLnBpMHf6r1rHTnGPsapoJOOYWnej71l425vX1qr5xnmxVNNB6HReObNv
+/jPoRYa5BVsVt2LmDcuH1O32pXJPWKVBR7Yfa6Bn2yzhcbECTc=
=ZByM
-----END PGP SIGNATURE-----
Merge tag 'net-5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes for 5.11-rc7, including fixes from bpf and mac80211
trees.
Current release - regressions:
- ip_tunnel: fix mtu calculation
- mlx5: fix function calculation for page trees
Previous releases - regressions:
- vsock: fix the race conditions in multi-transport support
- neighbour: prevent a dead entry from updating gc_list
- dsa: mv88e6xxx: override existent unicast portvec in port_fdb_add
Previous releases - always broken:
- bpf, cgroup: two copy_{from,to}_user() warn_on_once splats for BPF
cgroup getsockopt infra when user space is trying to race against
optlen, from Loris Reiff.
- bpf: add missing fput() in BPF inode storage map update helper
- udp: ipv4: manipulate network header of NATed UDP GRO fraglist
- mac80211: fix station rate table updates on assoc
- r8169: work around RTL8125 UDP HW bug
- igc: report speed and duplex as unknown when device is runtime
suspended
- rxrpc: fix deadlock around release of dst cached on udp tunnel"
* tag 'net-5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (36 commits)
net: hsr: align sup_multicast_addr in struct hsr_priv to u16 boundary
net: ipa: fix two format specifier errors
net: ipa: use the right accessor in ipa_endpoint_status_skip()
net: ipa: be explicit about endianness
net: ipa: add a missing __iomem attribute
net: ipa: pass correct dma_handle to dma_free_coherent()
r8169: fix WoL on shutdown if CONFIG_DEBUG_SHIRQ is set
net/rds: restrict iovecs length for RDS_CMSG_RDMA_ARGS
net: mvpp2: TCAM entry enable should be written after SRAM data
net: lapb: Copy the skb before sending a packet
net/mlx5e: Release skb in case of failure in tc update skb
net/mlx5e: Update max_opened_tc also when channels are closed
net/mlx5: Fix leak upon failure of rule creation
net/mlx5: Fix function calculation for page trees
docs: networking: swap words in icmp_errors_use_inbound_ifaddr doc
udp: ipv4: manipulate network header of NATed UDP GRO fraglist
net: ip_tunnel: fix mtu calculation
vsock: fix the race conditions in multi-transport support
net: sched: replaced invalid qdisc tree flush helper in qdisc_replace
ibmvnic: device remove has higher precedence over reset
...
Highlights include:
Bugfixes:
- SUNRPC: Handle 0 length opaque XDR object data properly
- Fix a layout segment leak in pnfs_layout_process()
- pNFS/NFSv4: Update the layout barrier when we schedule a layoutreturn
- pNFS/NFSv4: Improve rejection of out-of-order layouts
- pNFS/NFSv4: Try to return invalid layout in pnfs_layout_process()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmAW4QgACgkQZwvnipYK
APJNZw/6AnLawj0kjn7z0Wc2LA0QWxbAVGYGe28gQdy6qiBbuOiFDeH8itKk6m1c
R6ZPpFHFKYk6+CsNcNws2sz9gBQj7wzDIy3sHenIaiNgY/fWNKDC8woKkJFSUSMl
GSQ9rkCYwRJu1JxP7r/9gnw/86oUTy/PgMaGdz6CMZJlq9iNa8t2UqMOfmcN8EZ3
AIewe4fSV5ebfycVz6btdJy8OCwyUfQ1OMilfh+0+5HYlk/xUxr57+AHi9r8w6bq
3tzIq3imQRgZsPPo/DJo/D4hfeFYX849/Tp+I5ydREWIwREBz2PO8bHNFnDoeoLo
AJ8mkawvpx+jsHFaAHql6STvY7uTY7qqBqsX2qSCqd6n2VEU0+cnDCY1IcgjcfBR
ozaYHJQm9ZhHzska3r/aKBQmkth9LIPU6aIMcYtjzC3ywua2vfCBSPRYKES80kIV
Pzgf5yRZFTEp7jGV9Uhf3Hucm3oIF9WVonDpSPbThdHUUXAYAVK1HZwgWx72HskL
BEhdaD+zsacv58C1+BE3vlh6A/j/cZAQifTfflgkLE3JE1IiKJwFjH4q6jgLwccx
kWLopK9Ds+ta+kLtlCuNTsPt7aGUoZZleH1Ghzdkw5Dfv2eEnR3YM6raa294avw4
DzKE/Rzgv5JuoSJhkWW/PiBZHcxMsv3SK7LTjO2oteFz88olsgo=
=gLzv
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.11-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:
- SUNRPC: Handle 0 length opaque XDR object data properly
- Fix a layout segment leak in pnfs_layout_process()
- pNFS/NFSv4: Update the layout barrier when we schedule a layoutreturn
- pNFS/NFSv4: Improve rejection of out-of-order layouts
- pNFS/NFSv4: Try to return invalid layout in pnfs_layout_process()
* tag 'nfs-for-5.11-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
SUNRPC: Handle 0 length opaque XDR object data properly
SUNRPC: Move simple_get_bytes and simple_get_netobj into private header
pNFS/NFSv4: Improve rejection of out-of-order layouts
pNFS/NFSv4: Update the layout barrier when we schedule a layoutreturn
pNFS/NFSv4: Try to return invalid layout in pnfs_layout_process()
pNFS/NFSv4: Fix a layout segment leak in pnfs_layout_process()
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmAU7zwACgkQiiy9cAdy
T1FTFwwAicf0cTWv554HNz+7JLFledB7uVK7XgIMjqNzfGiGuBV1PQg6CGo/+FEi
/3W0qW+JzN3lKtRGKEb6BUEH/Eklz9p+RQl3K5H0e5YM/eGjpVAIB+6HxDSqq5XA
Iot4QCo7mnxdt8Keg0/X1s+ySp7QsYjK0QEHWPKBN5KrdzBtnSo0elJSNzmNXBXE
2aLCRyrszQmqjNLhePiuGvINR8nM8wKhNDV5iHN+UhvAboF4vIOBP/0kS5UuTo/D
NMlTvzp65+rag9NmJ64n19/WLU8MRnKrLm0HgpCDyCYQ09bXToM4DhKSAcUJsLYY
06DMF2mrKA0ZubRsoD3U2aFoC1gRji4/Dsx2/zJq5Lrj70TYxSrqJNH/F6wqPf8o
92rzm/k34EnmJMPu4omhA6M7eE6DUzFTtUcvwFgqfD95CglAvmiJ0YnN9fzS9pSB
s4+ON+0h/Wj/VukBDadjdWmlUkLwQnzW7o2AlMlJg/MAot1bn6d1TWe8kmSoD3D6
ZR7U5JZh
=4/ty
-----END PGP SIGNATURE-----
Merge tag '5.11-rc5-smb3' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Four cifs patches found in additional testing of the conversion to the
new mount API: three small option processing ones, and one fixing domain
based DFS referrals"
* tag '5.11-rc5-smb3' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix dfs domain referrals
cifs: returning mount parm processing errors correctly
cifs: fix mounts to subdirectories of target
cifs: ignore auto and noauto options if given
AF_RXRPC sockets use UDP ports in encap mode. This causes socket and dst
from an incoming packet to get stolen and attached to the UDP socket from
whence it is leaked when that socket is closed.
When a network namespace is removed, the wait for dst records to be cleaned
up happens before the cleanup of the rxrpc and UDP socket, meaning that the
wait never finishes.
Fix this by moving the rxrpc (and, by dependence, the afs) private
per-network namespace registrations to the device group rather than subsys
group. This allows cached rxrpc local endpoints to be cleared and their
UDP sockets closed before we try waiting for the dst records.
The symptom is that lines looking like the following:
unregister_netdevice: waiting for lo to become free
get emitted at regular intervals after running something like the
referenced syzbot test.
Thanks to Vadim for tracking this down and work out the fix.
Reported-by: syzbot+df400f2f24a1677cd7e0@syzkaller.appspotmail.com
Reported-by: Vadim Fedorenko <vfedorenko@novek.ru>
Fixes: 5271953cad ("rxrpc: Use the UDP encap_rcv hook")
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Vadim Fedorenko <vfedorenko@novek.ru>
Link: https://lore.kernel.org/r/161196443016.3868642.5577440140646403533.stgit@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmAUIkAACgkQxWXV+ddt
WDsWVg/+IIEk9H1v9q9ShvVmPvmnlT8/0ywj1hdwFMBkFBjIeU8tBz9ZMGPXCzrF
XemmWKChVOnR3SIq/bMrwuRC/Gv/pBvwVshXLP51YJHv7lSGX0Ayrb27BFQcVaC/
3QhpE7veEiqxwLyMj+LWG4hE2X+oqiqzrXCpeC5un4zEluT45RSKooqueQ4jM8aw
DrKLQA57a1YEIqrE2KQzy5A6BnSNyxPXEEX34kbugmmen46Fh77hrwme1K9vQn1t
v3/V4LcarXADxxokAxU2Igb/vK0+BN33NOYsBwLWWD4kUaTGS4KczsDOowkRRTMH
/qiQUdca0X7ElR+VFl8rgB8PxuJcZ87aCdsMkErUA4sjxyp11VDIeEgirPNAcXtR
b+1LIkn3k3l8JzkKyXwDuZuNBsh0idTY24IE+QDBMIGq+jE1N6N3t5gEwa2NeaiP
9O5QnS5XAJCo8a9+gp1aF5z94vwQwvf9TA80nGrnpxGmXEEEZ9PgXsc4JON1Blhn
NtJDwBPzEjHCEYdE73/lRMsLmYeGhpRugKb+lQ+OTo2iZzxH2SjWn9vXKiN7vAp2
zysjzdPfkY5BLggH5cPg0fuRaf/Is00EeVqn3eA7QsFKDhrpoPFBO+aV5xeshsaz
8fjt7kkXFb+Vyy4SDvmPioJQ7/MFZ5Czn+BL1JwO4l/vYcEMUzM=
=/yHv
-----END PGP SIGNATURE-----
Merge tag 'for-5.11-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes for a late rc:
- fix lockdep complaint on 32bit arches and also remove an unsafe
memory use due to device vs filesystem lifetime
- two fixes for free space tree:
* race during log replay and cache rebuild, now more likely to
happen due to changes in this dev cycle
* possible free space tree corruption with online conversion
during initial tree population"
* tag 'for-5.11-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix log replay failure due to race with space cache rebuild
btrfs: fix lockdep warning due to seqcount_mutex on 32bit arch
btrfs: fix possible free space tree corruption with online conversion
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmAUXQsQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppO4EAClcqoneAuhT4UvRVNxblXPhPaoC69aNgXd
s+34uQSCqeWrWIAokfKp8bh3kyRqe00591auA7DwtwNqGpWuIECX8o9QvROEkuxv
0o4JFGMTHOJKP1W79Oy3RpF5oee6rMMOQN7EFL272p2xd8NRCP33c4fKvJRz+DDE
0kCcZhVjca0nZ+9OJC+WAlV+dit3azCAKSp7cItJsdOgZL74ZcGECm0pA8RpStyi
tQrUr2yiHLkm1lcOYfid0fG2/5a4vAGZQav+EshOWYw9UGeMquq/aqPuZZtEUjKe
oEECACfJ9cWErsi1CirIk5j5RKHOHmFSG3kRAmyvFB4f3YDGYxerI7eodWjNA0d5
38wW96sWuV4l0ShPmD3jGWIDTTcDZh4nEImCObf5YJFbr2fQXofWVWseIyo0zG8Y
zDa1N/M7XgkrScX8OF33NC1uv/oExhHA7jXuQN6mRBESYjcCrH2Lf6mXAA2C8u4T
z1RaG7ckRXGSbV3ol1ROrHj0RTXQ3zeIHj3yMRU8TKH0z6s+ob46D2PZCLi6cLvI
IuELhzKsS1EzMSVsYk9/AegynWFjVCRJoVUVxTsrxfGEF7attwmur3lOAjbZwSWb
jXlRbrkgBL1Pwbjg8AODEoq0jJgVM/S/3fG2rpcYLwwYC+FQ73/K+URmEuMsqkFC
GrYllTSMFg==
=hb7W
-----END PGP SIGNATURE-----
Merge tag 'block-5.11-2021-01-29' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"All over the place fixes for this release:
- blk-cgroup iteration teardown resched fix (Baolin)
- NVMe pull request from Christoph:
- add another Write Zeroes quirk (Chaitanya Kulkarni)
- handle a no path available corner case (Daniel Wagner)
- use the proper RCU aware list_add helper (Chao Leng)
- bcache regression fix (Coly)
- bdev->bd_size_lock IRQ fix. This will be fixed in drivers for 5.12,
but for now, we'll make it IRQ safe (Damien)
- null_blk zoned init fix (Damien)
- add_partition() error handling fix (Dinghao)
- s390 dasd kobject fix (Jan)
- nbd fix for freezing queue while adding connections (Josef)
- tag queueing regression fix (Ming)
- revert of a patch that inadvertently meant that we regressed write
performance on raid (Maxim)"
* tag 'block-5.11-2021-01-29' of git://git.kernel.dk/linux-block:
null_blk: cleanup zoned mode initialization
nvme-core: use list_add_tail_rcu instead of list_add_tail for nvme_init_ns_head
nvme-multipath: Early exit if no path is available
nvme-pci: add the DISABLE_WRITE_ZEROES quirk for a SPCC device
bcache: only check feature sets when sb->version >= BCACHE_SB_VERSION_CDEV_WITH_FEATURES
block: fix bd_size_lock use
blk-cgroup: Use cond_resched() when destroy blkgs
Revert "block: simplify set_init_blocksize" to regain lost performance
nbd: freeze the queue while we're adding connections
s390/dasd: Fix inconsistent kobject removal
block: Fix an error handling in add_partition
blk-mq: test QUEUE_FLAG_HCTX_ACTIVE for sbitmap_shared in hctx_may_queue
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmAUXJoQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpplXD/9v4iQNBN/TzLnFufOAoSX8Y6Gm/0ykr7k3
wPVNBMJ4g7twdI2FDFZn6GDfEpT7+aIjSyPOcGbUznvVFNYzLrTdGpzxOXZ91E6K
G0wpxhYgQxeiaCYpfa4JFw1bfPSWM/e9IZ7dqO2rpUj0yJC2+0mUDP2xpoTbyfeR
bP/qVMp7Ij0WRul4GWHUN/KURYnpY97/3uGcXqjyxYA06KstIMMfxWCqx0So3eGR
MCjrHtASey/I0XnhcJ0M7Wa2OJBHzrh9txP2YCHtI1u3mU13V65L0kw5i4FzFlKY
g7OpAXmUnWuoLtUe/aPX5/gtSbtYeRrkmF4PRv/7FtW+pE9mWo7LtJC6ymWW/ymG
5qa3oc3X1A25EMnMngLfOcgOHkMQW5NQzBMXGObuYSQoiwp3eJY8JdJCabqbM8kx
9oJlOKiZU/jEbzNvPGZmjSjGj7uzAL90fK9K3X7pCB/ZIynzQo5mVhaGoeWNW6Nq
b+G0qcL79Ct1tas0Dgan86388yiS56CUGJOIGyDTvlIlKSCXo3K7/e1SFnt43M6K
WRHp8MgL7crM7UZpKAyBZD4BeL3SHp3sJMYdd0EgrJiHCO2IODDAmsuF8n57ef/1
aSmKKo8/hjxxFZ7NsBF8N3y+1SfItKjr3sZgGW6hXM+kzNFM2WPcOBHGsoejon/e
sZlBSj8D+w==
=oCb5
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.11-2021-01-29' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"We got the cancelation story sorted now, so for all intents and
purposes, this should be it for 5.11 outside of any potential little
fixes that may come in. This contains:
- task_work task state fixes (Hao, Pavel)
- Cancelation fixes (me, Pavel)
- Fix for an inflight req patch in this release (Pavel)
- Fix for a lock deadlock issue (Pavel)"
* tag 'io_uring-5.11-2021-01-29' of git://git.kernel.dk/linux-block:
io_uring: reinforce cancel on flush during exit
io_uring: fix sqo ownership false positive warning
io_uring: fix list corruption for splice file_get
io_uring: fix flush cqring overflow list while TASK_INTERRUPTIBLE
io_uring: fix wqe->lock/completion_lock deadlock
io_uring: fix cancellation taking mutex while TASK_UNINTERRUPTIBLE
io_uring: fix __io_uring_files_cancel() with TASK_UNINTERRUPTIBLE
io_uring: only call io_cqring_ev_posted() if events were posted
io_uring: if we see flush on exit, cancel related tasks
The new mount API requires additional changes to how DFS
is handled. Additional testing of DFS uncovered problems
with domain based DFS referrals (a follow on patch addresses
DFS links) which this patch addresses.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
of UID translations to occur, in some configurations, when setting v3
namespaced file capabilities.
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEKvuQkp28KvPJn/fVfL7QslYSS/0FAmATYfMRHGNvZGVAdHlo
aWNrcy5jb20ACgkQfL7QslYSS/2Gsg//SoR/VpEn3FfsRs/niAsPAmOi0i9m0zxD
rMZ7M12GAQFo9B2AaBhttUejcLzBuIM6DIpRUYyP9QBoSONPvmNy2l58tD3xd97k
ZXkGqOwG65q20zRTWiS2x3u271SwmCiZJzC7xIYUH36ZLWySfYLuI0QD8HqYdfD1
iNdDiYCzkilRG0PuaPIFtfq4OL/NizeBdBwJR2F5PJQdixocVsmdJKO+lTw5A8PJ
EefMC8lgV5pg+nVHlERXr9bg5BXaxRhE4hqwDPD7qB91piA8j7CxNIdagmjw5d7p
KOLYO4Ek3wKJY1MGMJ/hBNXQIBxMJX7DBEFUi1y/+Eiw3QUr9XUkQiqe6nYLALPa
m0IOKQJOkcuZLd5cCACnfv6XTu8iAabpilwUIi6TnADwzByc0jaIjypIbkVulScH
YMGE+HO9X0jzMfpWMG/FopHVuGb5t6zdIukfb/Ndo6tIbEZx+cr+uZ3HEB86mMLw
diJVXDGtdRGZRz5seN6mVRuGUFL/Xlg/wEhq60hxxrBstW7yvMGkjII0AowH3Sri
pCXXq6W/t2MA1sJKfSp33v3vFeG98y77aDO0Djh3G7cg4XLCiAxQkQu3RO1Vm9wi
U5Q3Hd2Cmaxw/NG7tnG5R79wAEBoGW7LeRCAYsjcs6O1pmpn9krn9vSfXyhOAKz4
d1ukk31djsg=
=G4uE
-----END PGP SIGNATURE-----
Merge tag 'ecryptfs-5.11-rc6-setxattr-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs
Pull ecryptfs fix from Tyler Hicks:
"Fix a regression that resulted in two rounds of UID translations when
setting v3 namespaced file capabilities in some configurations"
* tag 'ecryptfs-5.11-rc6-setxattr-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
ecryptfs: fix uid translation for setxattr on security.capability
What 84965ff8a8 ("io_uring: if we see flush on exit, cancel related tasks")
really wants is to cancel all relevant REQ_F_INFLIGHT requests reliably.
That can be achieved by io_uring_cancel_files(), but we'll miss it
calling io_uring_cancel_task_requests(files=NULL) from io_uring_flush(),
because it will go through __io_uring_cancel_task_requests().
Just always call io_uring_cancel_files() during cancel, it's good enough
for now.
Cc: stable@vger.kernel.org # 5.9+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
During additional testing of the updated cifs.ko with the
new mount API support, we found a few additional cases where
we were logging errors, but not returning them to the user.
For example:
a) invalid security mechanisms
b) invalid cache options
c) unsupported rdma
d) invalid smb dialect requested
Fixes: 24e0a1eff9 ("cifs: switch to new mount api")
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
WARNING: CPU: 0 PID: 21359 at fs/io_uring.c:9042
io_uring_cancel_task_requests+0xe55/0x10c0 fs/io_uring.c:9042
Call Trace:
io_uring_flush+0x47b/0x6e0 fs/io_uring.c:9227
filp_close+0xb4/0x170 fs/open.c:1295
close_files fs/file.c:403 [inline]
put_files_struct fs/file.c:418 [inline]
put_files_struct+0x1cc/0x350 fs/file.c:415
exit_files+0x7e/0xa0 fs/file.c:435
do_exit+0xc22/0x2ae0 kernel/exit.c:820
do_group_exit+0x125/0x310 kernel/exit.c:922
get_signal+0x427/0x20f0 kernel/signal.c:2773
arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
handle_signal_work kernel/entry/common.c:147 [inline]
exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
exit_to_user_mode_prepare+0x148/0x250 kernel/entry/common.c:201
__syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:302
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Now io_uring_cancel_task_requests() can be called not through file
notes but directly, remove a WARN_ONCE() there that give us false
positives. That check is not very important and we catch it in other
places.
Fixes: 84965ff8a8 ("io_uring: if we see flush on exit, cancel related tasks")
Cc: stable@vger.kernel.org # 5.9+
Reported-by: syzbot+3e3d9bd0c6ce9efbc3ef@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The "prefixpath" mount option needs to be ignored
which was missed in the recent conversion to the
new mount API (prefixpath would be set by the mount
helper if mounting a subdirectory of the root of a
share e.g. //server/share/subdir)
Fixes: 24e0a1eff9 ("cifs: switch to new mount api")
Suggested-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
In 24e0a1eff9, the noauto and auto options were missed when migrating
to the new mount API. As a result, users with noauto in their fstab
mount options are now unable to mount cifs filesystems, as they'll
receive an "Unknown parameter" error.
This restores the old behaviour of ignoring noauto and auto if they're
given.
Fixes: 24e0a1eff9 ("cifs: switch to new mount api")
Signed-off-by: Adam Harvey <adam@adamharvey.name>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Overlayfs's volatile option allows the user to bypass all forced sync calls
to the upperdir filesystem. This comes at the cost of safety. We can never
ensure that the user's data is intact, but we can make a best effort to
expose whether or not the data is likely to be in a bad state.
The best way to handle this in the time being is that if an overlayfs's
upperdir experiences an error after a volatile mount occurs, that error
will be returned on fsync, fdatasync, sync, and syncfs. This is
contradictory to the traditional behaviour of VFS which fails the call
once, and only raises an error if a subsequent fsync error has occurred,
and been raised by the filesystem.
One awkward aspect of the patch is that we have to manually set the
superblock's errseq_t after the sync_fs callback as opposed to just
returning an error from syncfs. This is because the call chain looks
something like this:
sys_syncfs ->
sync_filesystem ->
__sync_filesystem ->
/* The return value is ignored here
sb->s_op->sync_fs(sb)
_sync_blockdev
/* Where the VFS fetches the error to raise to userspace */
errseq_check_and_advance
Because of this we call errseq_set every time the sync_fs callback occurs.
Due to the nature of this seen / unseen dichotomy, if the upperdir is an
inconsistent state at the initial mount time, overlayfs will refuse to
mount, as overlayfs cannot get a snapshot of the upperdir's errseq that
will increment on error until the user calls syncfs.
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Fixes: c86243b090 ("ovl: provide a mount option "volatile"")
Cc: stable@vger.kernel.org
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When inode has no listxattr op of its own (e.g. squashfs) vfs_listxattr
calls the LSM inode_listsecurity hooks to list the xattrs that LSMs will
intercept in inode_getxattr hooks.
When selinux LSM is installed but not initialized, it will list the
security.selinux xattr in inode_listsecurity, but will not intercept it
in inode_getxattr. This results in -ENODATA for a getxattr call for an
xattr returned by listxattr.
This situation was manifested as overlayfs failure to copy up lower
files from squashfs when selinux is built-in but not initialized,
because ovl_copy_xattr() iterates the lower inode xattrs by
vfs_listxattr() and vfs_getxattr().
ovl_copy_xattr() skips copy up of security labels that are indentified by
inode_copy_up_xattr LSM hooks, but it does that after vfs_getxattr().
Since we are not going to copy them, skip vfs_getxattr() of the security
labels.
Reported-by: Michael Labriola <michael.d.labriola@gmail.com>
Tested-by: Michael Labriola <michael.d.labriola@gmail.com>
Link: https://lore.kernel.org/linux-unionfs/2nv9d47zt7.fsf@aldarion.sourceruckus.org/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The function ovl_dir_real_file() currently uses the inode lock to serialize
writes to the od->upperfile field.
However, this function will get called by ovl_ioctl_set_flags(), which
utilizes the inode lock too. In this case ovl_dir_real_file() will try to
claim a lock that is owned by a function in its call stack, which won't get
released before ovl_dir_real_file() returns.
Fix by replacing the open coded compare and exchange by an explicit atomic
op.
Fixes: 61536bed21 ("ovl: support [S|G]ETFLAGS and FS[S|G]ETXATTR ioctls for directories")
Cc: stable@vger.kernel.org # v5.10
Reported-by: Icenowy Zheng <icenowy@aosc.io>
Tested-by: Icenowy Zheng <icenowy@aosc.io>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The vfs_getxattr() in ovl_xattr_set() is used to check whether an xattr
exist on a lower layer file that is to be removed. If the xattr does not
exist, then no need to copy up the file.
This call of vfs_getxattr() wasn't wrapped in credential override, and this
is probably okay. But for consitency wrap this instance as well.
Reported-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Currently there's no way to create an overlay filesystem outside of the
current user namespace. Make sure that if this assumption changes it
doesn't go unnoticed.
Reported-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The cited commit introduced a serious regression with SATA write speed,
as found by bisecting. This patch reverts this commit, which restores
write speed back to the values observed before this commit.
The performance tests were done on a Helios4 NAS (2nd batch) with 4 HDDs
(WD8003FFBX) using dd (bs=1M count=2000). "Direct" is a test with a
single HDD, the rest are different RAID levels built over the first
partitions of 4 HDDs. Test results are in MB/s, R is read, W is write.
| Direct | RAID0 | RAID10 f2 | RAID10 n2 | RAID6
----------------+--------+-------+-----------+-----------+--------
9011495c94 | R:256 | R:313 | R:276 | R:313 | R:323
(before faulty) | W:254 | W:253 | W:195 | W:204 | W:117
----------------+--------+-------+-----------+-----------+--------
5ff9f19231 | R:257 | R:398 | R:312 | R:344 | R:391
(faulty commit) | W:154 | W:122 | W:67.7 | W:66.6 | W:67.2
----------------+--------+-------+-----------+-----------+--------
5.10.10 | R:256 | R:401 | R:312 | R:356 | R:375
unpatched | W:149 | W:123 | W:64 | W:64.1 | W:61.5
----------------+--------+-------+-----------+-----------+--------
5.10.10 | R:255 | R:396 | R:312 | R:340 | R:393
patched | W:247 | W:274 | W:220 | W:225 | W:121
Applying this patch doesn't hurt read performance, while improves the
write speed by 1.5x - 3.5x (more impact on RAID tests). The write speed
is restored back to the state before the faulty commit, and even a bit
higher in RAID tests (which aren't HDD-bound on this device) - that is
likely related to other optimizations done between the faulty commit and
5.10.10 which also improved the read speed.
Signed-off-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Fixes: 5ff9f19231 ("block: simplify set_init_blocksize")
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the tctx inflight number haven't changed because of cancellation,
__io_uring_task_cancel() will continue leaving the task in
TASK_UNINTERRUPTIBLE state, that's not expected by
__io_uring_files_cancel(). Ensure we always call finish_wait() before
retrying.
Cc: stable@vger.kernel.org # 5.9+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Prior to commit 7c03e2cda4 ("vfs: move cap_convert_nscap() call into
vfs_setxattr()") the translation of nscap->rootid did not take stacked
filesystems (overlayfs and ecryptfs) into account.
That patch fixed the overlay case, but made the ecryptfs case worse.
Restore old the behavior for ecryptfs that existed before the overlayfs
fix. This does not fix ecryptfs's handling of complex user namespace
setups, but it does make sure existing setups don't regress.
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Tyler Hicks <code@tyhicks.com>
Fixes: 7c03e2cda4 ("vfs: move cap_convert_nscap() call into vfs_setxattr()")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Tyler Hicks <code@tyhicks.com>
After commit 36e2c7421f ("fs: don't allow splice read/write
without explicit ops") sendfile() could no longer send data
from a real file to a pipe, breaking for example certain cgit
setups (e.g. when running behind fcgiwrap), because in this
case cgit will try to do exactly this: sendfile() to a pipe.
Fix this by using iter_file_splice_write for the splice_write
method of pipes, as suggested by Christoph.
Cc: stable@vger.kernel.org
Fixes: 36e2c7421f ("fs: don't allow splice read/write without explicit ops")
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After a sudden power failure we may end up with a space cache on disk that
is not valid and needs to be rebuilt from scratch.
If that happens, during log replay when we attempt to pin an extent buffer
from a log tree, at btrfs_pin_extent_for_log_replay(), we do not wait for
the space cache to be rebuilt through the call to:
btrfs_cache_block_group(cache, 1);
That is because that only waits for the task (work queue job) that loads
the space cache to change the cache state from BTRFS_CACHE_FAST to any
other value. That is ok when the space cache on disk exists and is valid,
but when the cache is not valid and needs to be rebuilt, it ends up
returning as soon as the cache state changes to BTRFS_CACHE_STARTED (done
at caching_thread()).
So this means that we can end up trying to unpin a range which is not yet
marked as free in the block group. This results in the call to
btrfs_remove_free_space() to return -EINVAL to
btrfs_pin_extent_for_log_replay(), which in turn makes the log replay fail
as well as mounting the filesystem. More specifically the -EINVAL comes
from free_space_cache.c:remove_from_bitmap(), because the requested range
is not marked as free space (ones in the bitmap), we have the following
condition triggered:
static noinline int remove_from_bitmap(struct btrfs_free_space_ctl *ctl,
(...)
if (ret < 0 || search_start != *offset)
return -EINVAL;
(...)
It's the "search_start != *offset" that results in the condition being
evaluated to true.
When this happens we got the following in dmesg/syslog:
[72383.415114] BTRFS: device fsid 32b95b69-0ea9-496a-9f02-3f5a56dc9322 devid 1 transid 1432 /dev/sdb scanned by mount (3816007)
[72383.417837] BTRFS info (device sdb): disk space caching is enabled
[72383.418536] BTRFS info (device sdb): has skinny extents
[72383.423846] BTRFS info (device sdb): start tree-log replay
[72383.426416] BTRFS warning (device sdb): block group 30408704 has wrong amount of free space
[72383.427686] BTRFS warning (device sdb): failed to load free space cache for block group 30408704, rebuilding it now
[72383.454291] BTRFS: error (device sdb) in btrfs_recover_log_trees:6203: errno=-22 unknown (Failed to pin buffers while recovering log root tree.)
[72383.456725] BTRFS: error (device sdb) in btrfs_replay_log:2253: errno=-22 unknown (Failed to recover log tree)
[72383.460241] BTRFS error (device sdb): open_ctree failed
We also mark the range for the extent buffer in the excluded extents io
tree. That is fine when the space cache is valid on disk and we can load
it, in which case it causes no problems.
However, for the case where we need to rebuild the space cache, because it
is either invalid or it is missing, having the extent buffer range marked
in the excluded extents io tree leads to a -EINVAL failure from the call
to btrfs_remove_free_space(), resulting in the log replay and mount to
fail. This is because by having the range marked in the excluded extents
io tree, the caching thread ends up never adding the range of the extent
buffer as free space in the block group since the calls to
add_new_free_space(), called from load_extent_tree_free(), filter out any
ranges that are marked as excluded extents.
So fix this by making sure that during log replay we wait for the caching
task to finish completely when we need to rebuild a space cache, and also
drop the need to mark the extent buffer range in the excluded extents io
tree, as well as clearing ranges from that tree at
btrfs_finish_extent_commit().
This started to happen with some frequency on large filesystems having
block groups with a lot of fragmentation since the recent commit
e747853cae ("btrfs: load free space cache asynchronously"), but in
fact the issue has been there for years, it was just much less likely
to happen.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This effectively reverts commit d5c8238849 ("btrfs: convert
data_seqcount to seqcount_mutex_t").
While running fstests on 32 bits test box, many tests failed because of
warnings in dmesg. One of those warnings (btrfs/003):
[66.441317] WARNING: CPU: 6 PID: 9251 at include/linux/seqlock.h:279 btrfs_remove_chunk+0x58b/0x7b0 [btrfs]
[66.441446] CPU: 6 PID: 9251 Comm: btrfs Tainted: G O 5.11.0-rc4-custom+ #5
[66.441449] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
[66.441451] EIP: btrfs_remove_chunk+0x58b/0x7b0 [btrfs]
[66.441472] EAX: 00000000 EBX: 00000001 ECX: c576070c EDX: c6b15803
[66.441475] ESI: 10000000 EDI: 00000000 EBP: c56fbcfc ESP: c56fbc70
[66.441477] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010246
[66.441481] CR0: 80050033 CR2: 05c8da20 CR3: 04b20000 CR4: 00350ed0
[66.441485] Call Trace:
[66.441510] btrfs_relocate_chunk+0xb1/0x100 [btrfs]
[66.441529] ? btrfs_lookup_block_group+0x17/0x20 [btrfs]
[66.441562] btrfs_balance+0x8ed/0x13b0 [btrfs]
[66.441586] ? btrfs_ioctl_balance+0x333/0x3c0 [btrfs]
[66.441619] ? __this_cpu_preempt_check+0xf/0x11
[66.441643] btrfs_ioctl_balance+0x333/0x3c0 [btrfs]
[66.441664] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[66.441683] btrfs_ioctl+0x414/0x2ae0 [btrfs]
[66.441700] ? __lock_acquire+0x35f/0x2650
[66.441717] ? lockdep_hardirqs_on+0x87/0x120
[66.441720] ? lockdep_hardirqs_on_prepare+0xd0/0x1e0
[66.441724] ? call_rcu+0x2d3/0x530
[66.441731] ? __might_fault+0x41/0x90
[66.441736] ? kvm_sched_clock_read+0x15/0x50
[66.441740] ? sched_clock+0x8/0x10
[66.441745] ? sched_clock_cpu+0x13/0x180
[66.441750] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[66.441750] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[66.441768] __ia32_sys_ioctl+0x165/0x8a0
[66.441773] ? __this_cpu_preempt_check+0xf/0x11
[66.441785] ? __might_fault+0x89/0x90
[66.441791] __do_fast_syscall_32+0x54/0x80
[66.441796] do_fast_syscall_32+0x32/0x70
[66.441801] do_SYSENTER_32+0x15/0x20
[66.441805] entry_SYSENTER_32+0x9f/0xf2
[66.441808] EIP: 0xab7b5549
[66.441814] EAX: ffffffda EBX: 00000003 ECX: c4009420 EDX: bfa91f5c
[66.441816] ESI: 00000003 EDI: 00000001 EBP: 00000000 ESP: bfa91e98
[66.441818] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000292
[66.441833] irq event stamp: 42579
[66.441835] hardirqs last enabled at (42585): [<c60eb065>] console_unlock+0x495/0x590
[66.441838] hardirqs last disabled at (42590): [<c60eafd5>] console_unlock+0x405/0x590
[66.441840] softirqs last enabled at (41698): [<c601b76c>] call_on_stack+0x1c/0x60
[66.441843] softirqs last disabled at (41681): [<c601b76c>] call_on_stack+0x1c/0x60
========================================================================
btrfs_remove_chunk+0x58b/0x7b0:
__seqprop_mutex_assert at linux/./include/linux/seqlock.h:279
(inlined by) btrfs_device_set_bytes_used at linux/fs/btrfs/volumes.h:212
(inlined by) btrfs_remove_chunk at linux/fs/btrfs/volumes.c:2994
========================================================================
The warning is produced by lockdep_assert_held() in
__seqprop_mutex_assert() if CONFIG_LOCKDEP is enabled.
And "olumes.c:2994 is btrfs_device_set_bytes_used() with mutex lock
fs_info->chunk_mutex held already.
After adding some debug prints, the cause was found that many
__alloc_device() are called with NULL @fs_info (during scanning ioctl).
Inside the function, btrfs_device_data_ordered_init() is expanded to
seqcount_mutex_init(). In this scenario, its second
parameter info->chunk_mutex is &NULL->chunk_mutex which equals
to offsetof(struct btrfs_fs_info, chunk_mutex) unexpectedly. Thus,
seqcount_mutex_init() is called in wrong way. And later
btrfs_device_get/set helpers trigger lockdep warnings.
The device and filesystem object lifetimes are different and we'd have
to synchronize initialization of the btrfs_device::data_seqcount with
the fs_info, possibly using some additional synchronization. It would
still not prevent concurrent access to the seqcount lock when it's used
for read and initialization.
Commit d5c8238849 ("btrfs: convert data_seqcount to seqcount_mutex_t")
does not mention a particular problem being fixed so revert should not
cause any harm and we'll get the lockdep warning fixed.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=210139
Reported-by: Erhard F <erhard_f@mailbox.org>
Fixes: d5c8238849 ("btrfs: convert data_seqcount to seqcount_mutex_t")
CC: stable@vger.kernel.org # 5.10
CC: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While running btrfs/011 in a loop I would often ASSERT() while trying to
add a new free space entry that already existed, or get an EEXIST while
adding a new block to the extent tree, which is another indication of
double allocation.
This occurs because when we do the free space tree population, we create
the new root and then populate the tree and commit the transaction.
The problem is when you create a new root, the root node and commit root
node are the same. During this initial transaction commit we will run
all of the delayed refs that were paused during the free space tree
generation, and thus begin to cache block groups. While caching block
groups the caching thread will be reading from the main root for the
free space tree, so as we make allocations we'll be changing the free
space tree, which can cause us to add the same range twice which results
in either the ASSERT(ret != -EEXIST); in __btrfs_add_free_space, or in a
variety of different errors when running delayed refs because of a
double allocation.
Fix this by marking the fs_info as unsafe to load the free space tree,
and fall back on the old slow method. We could be smarter than this,
for example caching the block group while we're populating the free
space tree, but since this is a serious problem I've opted for the
simplest solution.
CC: stable@vger.kernel.org # 4.9+
Fixes: a5ed918285 ("Btrfs: implement the free space B-tree")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If a layoutget ends up being reordered w.r.t. a layoutreturn, e.g. due
to a layoutget-on-open not knowing a priori which file to lock, then we
must assume the layout is no longer being considered valid state by the
server.
Incrementally improve our ability to reject such states by using the
cached old stateid in conjunction with the plh_barrier to try to
identify them.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When we're scheduling a layoutreturn, we need to ignore any further
incoming layouts with sequence ids that are going to be affected by the
layout return.
Fixes: 44ea8dfce0 ("NFS/pnfs: Reference the layout cred in pnfs_prepare_layoutreturn()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the server returns a new stateid that does not match the one in our
cache, then try to return the one we hold instead of just invalidating
it on the client side. This ensures that both client and server will
agree that the stateid is invalid.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the server returns a new stateid that does not match the one in our
cache, then pnfs_layout_process() will leak the layout segments returned
by pnfs_mark_layout_stateid_invalid().
Fixes: 9888d837f3 ("pNFS: Force a retry of LAYOUTGET if the stateid doesn't match our cache")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
This normally doesn't cause any extra harm, but it does mean that we'll
increment the eventfd notification count, if one has been registered
with the ring. This can confuse applications, when they see more
notifications on the eventfd side than are available in the ring.
Do the nice thing and only increment this count, if we actually posted
(or even overflowed) events.
Reported-and-tested-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ensure we match tasks that belong to a dead or dying task as well, as we
need to reap those in addition to those belonging to the exiting task.
Cc: stable@vger.kernel.org # 5.9+
Reported-by: Josef Grieb <josef.grieb@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmANwKAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgprSsD/9Gp1dLmjYv0EhBAKSqC3UaXCLIZujontwb
OWEM03ssjQrg0s5MNEAvzmFdScJequVMoGay6OeLf2jXA0xl9jr71kThVuwziZAK
otspuMV2YrgL8FHclg52Uq3b+RKJGyCjNE2HlytRVGKwp501c6rcOESkeiiSJ/gE
92lkJCATkle8CLfiRp+zf7owbiEFvJrztpa4CLgNCLlQyYiqMK0Vnct2J29NLoXt
fWxVOx8rTpQy1P0y+YfsMh4uxcb7MzI2bG2Cqm7G0VACe9XtzaXorTxuKTC1sQ4d
Q1+h1eYIlTITcQ3WwZ1FZzF/OPidJHd/BKQRlP4qfCwt5f4SN4/hH9Mxhg9K3Uq2
c51wl0RD8FPI+S8XOe0btKopnZmhIqci76jKSFkbYAxm3XzVgiBOl4huD6e5JrLl
3VpxTNsATmMgatdAvxAb5DzoeUUVMmo4sb74mtpavXPg/i6E64joOxfcozLswXgc
rwa83GBigXOLythGTemeySG9GlCl0MCU5MioFRO2men5+rF2Lq2MfY51GEnP+if+
8O27/PsN+d9vuCwWHqhyDFvUIF2s6KqvDKoj3X54xKf5PpB3mISUYlamyvoVZzMw
t3OBalUq51JAxYhxTiF7RMYKEz7pBQ2aTBy1YV3sXLEcSzgt0iskWDi8mYB5/2lc
oIiSTapZDA==
=Vbhq
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.11-2021-01-24' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Still need a final cancelation fix that isn't quite done done,
expected in the next day or two. That said, this contains:
- Wakeup fix for IOPOLL requests
- SQPOLL split close op handling fix
- Ensure that any use of io_uring fd itself is marked as inflight
- Short non-regular file read fix (Pavel)
- Fix up bad false positive warning (Pavel)
- SQPOLL fixes (Pavel)
- In-flight removal fix (Pavel)"
* tag 'io_uring-5.11-2021-01-24' of git://git.kernel.dk/linux-block:
io_uring: account io_uring internal files as REQ_F_INFLIGHT
io_uring: fix sleeping under spin in __io_clean_op
io_uring: fix short read retries for non-reg files
io_uring: fix SQPOLL IORING_OP_CLOSE cancelation state
io_uring: fix skipping disabling sqo on exec
io_uring: fix uring_flush in exit_files() warning
io_uring: fix false positive sqo warning on flush
io_uring: iopoll requests should also wake task ->in_idle state
Merge misc fixes from Andrew Morton:
"18 patches.
Subsystems affected by this patch series: mm (pagealloc, memcg, kasan,
memory-failure, and highmem), ubsan, proc, and MAINTAINERS"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
MAINTAINERS: add a couple more files to the Clang/LLVM section
proc_sysctl: fix oops caused by incorrect command parameters
powerpc/mm/highmem: use __set_pte_at() for kmap_local()
mips/mm/highmem: use set_pte() for kmap_local()
mm/highmem: prepare for overriding set_pte_at()
sparc/mm/highmem: flush cache and TLB
mm: fix page reference leak in soft_offline_page()
ubsan: disable unsigned-overflow check for i386
kasan, mm: fix resetting page_alloc tags for HW_TAGS
kasan, mm: fix conflicts with init_on_alloc/free
kasan: fix HW_TAGS boot parameters
kasan: fix incorrect arguments passing in kasan_add_zero_shadow
kasan: fix unaligned address is unhandled in kasan_remove_zero_shadow
mm: fix numa stats for thp migration
mm: memcg: fix memcg file_dirty numa stat
mm: memcg/slab: optimize objcg stock draining
mm: fix initialization of struct page for holes in memory layout
x86/setup: don't remove E820_TYPE_RAM for pfn 0
Here are some small driver core fixes for 5.11-rc5 that resolve some
reported problems:
- revert of a -rc1 patch that was causing problems with some
machines
- device link device name collision problem fix (busses only
have to name devices unique to their bus, not unique to all
busses)
- kernfs splice bugfixes to resolve firmware loading problems
for Qualcomm systems.
- other tiny driver core fixes for minor issues reported.
All of these have been in linux-next with no reported problems.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYA1qGw8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ymd2ACgzVEpWJddtLNz+9guU9kAIIPcNboAn2GreCle
vgNkgCapi2ZjYtWBk8Cl
=YNIw
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are some small driver core fixes for 5.11-rc5 that resolve some
reported problems:
- revert of a -rc1 patch that was causing problems with some machines
- device link device name collision problem fix (busses only have to
name devices unique to their bus, not unique to all busses)
- kernfs splice bugfixes to resolve firmware loading problems for
Qualcomm systems.
- other tiny driver core fixes for minor issues reported.
All of these have been in linux-next with no reported problems"
* tag 'driver-core-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
driver core: Fix device link device name collision
driver core: Extend device_is_dependent()
kernfs: wire up ->splice_read and ->splice_write
kernfs: implement ->write_iter
kernfs: implement ->read_iter
Revert "driver core: Reorder devices on successful probe"
Driver core: platform: Add extra error check in devm_platform_get_irqs_affinity()
drivers core: Free dma_range_map when driver probe failed
The process_sysctl_arg() does not check whether val is empty before
invoking strlen(val). If the command line parameter () is incorrectly
configured and val is empty, oops is triggered.
For example:
"hung_task_panic=1" is incorrectly written as "hung_task_panic", oops is
triggered. The call stack is as follows:
Kernel command line: .... hung_task_panic
......
Call trace:
__pi_strlen+0x10/0x98
parse_args+0x278/0x344
do_sysctl_args+0x8c/0xfc
kernel_init+0x5c/0xf4
ret_from_fork+0x10/0x30
To fix it, check whether "val" is empty when "phram" is a sysctl field.
Error codes are returned in the failure branch, and error logs are
generated by parse_args().
Link: https://lkml.kernel.org/r/20210118133029.28580-1-nixiaoming@huawei.com
Fixes: 3db978d480 ("kernel/sysctl: support setting sysctl parameters from kernel command line")
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: <stable@vger.kernel.org> [5.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmAM+XcACgkQiiy9cAdy
T1E44gwAnonpHv2eDi6Lo6/Ev3hP7kETaIcYc3Jql61gz0rn880cNh67C5osdxVY
hdaKycuwQOBQbHcrt6kOfpOOcnjVXb92L/THzUJcHgUQx4bUFHNhuYt8NTrQFsnd
ke+nLVbRcAiQRVAN46d9y2YQBA64SGUrbjjGyFD0ew13ORpsGRVBQ1UjEXzUvObV
d0rFrlVoAIZA3RAj5uxBeHnyIy3pZaviEhg0ZlL4QEqEDhcKh82kWcX4AdZFOpwI
qrgJ1/TtJPj4bhPPCsWXwkSQSF9JETmigRCn03fRyykvdt2f4xIgvVyaqhWqmpvv
BZL8BKzutcFw7c5FTgNvjxAH5G2xGpefdPKE+ve0AwGnL0yrWIjk6tbpCPs9BqEe
nD2Pxgb+r9aCtXYYYgkZ59YOF50gdtJF7ffFqMWZ+OGckFSHOpLeFP8IW9Q102b9
VRkMTZQSd0E60ClVvuaEFIVHX52bJxM0kgdtH/2DVGdARllA4QjZuQdXZuNughYi
zKckmEyQ
=wHrR
-----END PGP SIGNATURE-----
Merge tag '5.11-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"An important signal handling patch for stable, and two small cleanup
patches"
* tag '5.11-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6:
cifs: do not fail __smb_send_rqst if non-fatal signals are pending
fs/cifs: Simplify bool comparison.
fs/cifs: Assign boolean values to a bool variable
We need to actively cancel anything that introduces a potential circular
loop, where io_uring holds a reference to itself. If the file in question
is an io_uring file, then add the request to the inflight list.
Cc: stable@vger.kernel.org # 5.9+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
RHBZ 1848178
The original intent of returning an error in this function
in the patch:
"CIFS: Mask off signals when sending SMB packets"
was to avoid interrupting packet send in the middle of
sending the data (and thus breaking an SMB connection),
but we also don't want to fail the request for non-fatal
signals even before we have had a chance to try to
send it (the reported problem could be reproduced e.g.
by exiting a child process when the parent process was in
the midst of calling futimens to update a file's timestamps).
In addition, since the signal may remain pending when we enter the
sending loop, we may end up not sending the whole packet before
TCP buffers become full. In this case the code returns -EINTR
but what we need here is to return -ERESTARTSYS instead to
allow system calls to be restarted.
Fixes: b30c74c73c ("CIFS: Mask off signals when sending SMB packets")
Cc: stable@vger.kernel.org # v5.1+
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
prompted by the fact that a bunch of code was moved in this cycle.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmAK9skTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHziyGwB/9rZnaaYR6Frqc2tzE5vbVtjAxvhftn
pGDr8laOHiK5jnKR+ljNlPAe07TSEK+qVidulX05moujKrZeIrDUJZnEpScrssZO
o7Tm99dHziqJc10liembtSZzB3LzGJyW1hgavC5Vjo7JW+EZ+YR9x2pFKCO7Hz/M
QlT6kQmXZLnsLB2OieAyC9Yb7IMD1wfiOHHvOZDeFpIn49Z8reFahI+dSwwK/uOv
UouxZKKuaikSTvzp8WmTiuCpsUfBMOaDy5/pWLfBS+/116K2aieJmzSjUb2MZwDT
cLGhzrkyeCkBeFO5vhnob3n9KqWXN03I9rPB25StcrHCRYcHa3z/D/4k
=9COg
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.11-rc5' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"A patch to zero out sensitive cryptographic data and two minor
cleanups prompted by the fact that a bunch of code was moved in this
cycle"
* tag 'ceph-for-5.11-rc5' of git://github.com/ceph/ceph-client:
libceph: fix "Boolean result is used in bitwise operation" warning
libceph, ceph: disambiguate ceph_connection_operations handlers
libceph: zero out session key and connection secret
Sockets and other non-regular files may actually expect short reads to
happen, don't retry reads for them. Because non-reg files don't set
FMODE_BUF_RASYNC and so it won't do second/retry do_read, we can filter
out those cases after first do_read() attempt with ret>0.
Cc: stable@vger.kernel.org # 5.9+
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
IORING_OP_CLOSE is special in terms of cancelation, since it has an
intermediate state where we've removed the file descriptor but hasn't
closed the file yet. For that reason, it's currently marked with
IO_WQ_WORK_NO_CANCEL to prevent cancelation. This ensures that the op
is always run even if canceled, to prevent leaving us with a live file
but an fd that is gone. However, with SQPOLL, since a cancel request
doesn't carry any resources on behalf of the request being canceled, if
we cancel before any of the close op has been run, we can end up with
io-wq not having the ->files assigned. This can result in the following
oops reported by Joseph:
BUG: kernel NULL pointer dereference, address: 00000000000000d8
PGD 800000010b76f067 P4D 800000010b76f067 PUD 10b462067 PMD 0
Oops: 0000 [#1] SMP PTI
CPU: 1 PID: 1788 Comm: io_uring-sq Not tainted 5.11.0-rc4 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:__lock_acquire+0x19d/0x18c0
Code: 00 00 8b 1d fd 56 dd 08 85 db 0f 85 43 05 00 00 48 c7 c6 98 7b 95 82 48 c7 c7 57 96 93 82 e8 9a bc f5 ff 0f 0b e9 2b 05 00 00 <48> 81 3f c0 ca 67 8a b8 00 00 00 00 41 0f 45 c0 89 04 24 e9 81 fe
RSP: 0018:ffffc90001933828 EFLAGS: 00010002
RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000d8
RBP: 0000000000000246 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: ffff888106e8a140 R15: 00000000000000d8
FS: 0000000000000000(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000d8 CR3: 0000000106efa004 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
lock_acquire+0x31a/0x440
? close_fd_get_file+0x39/0x160
? __lock_acquire+0x647/0x18c0
_raw_spin_lock+0x2c/0x40
? close_fd_get_file+0x39/0x160
close_fd_get_file+0x39/0x160
io_issue_sqe+0x1334/0x14e0
? lock_acquire+0x31a/0x440
? __io_free_req+0xcf/0x2e0
? __io_free_req+0x175/0x2e0
? find_held_lock+0x28/0xb0
? io_wq_submit_work+0x7f/0x240
io_wq_submit_work+0x7f/0x240
io_wq_cancel_cb+0x161/0x580
? io_wqe_wake_worker+0x114/0x360
? io_uring_get_socket+0x40/0x40
io_async_find_and_cancel+0x3b/0x140
io_issue_sqe+0xbe1/0x14e0
? __lock_acquire+0x647/0x18c0
? __io_queue_sqe+0x10b/0x5f0
__io_queue_sqe+0x10b/0x5f0
? io_req_prep+0xdb/0x1150
? mark_held_locks+0x6d/0xb0
? mark_held_locks+0x6d/0xb0
? io_queue_sqe+0x235/0x4b0
io_queue_sqe+0x235/0x4b0
io_submit_sqes+0xd7e/0x12a0
? _raw_spin_unlock_irq+0x24/0x30
? io_sq_thread+0x3ae/0x940
io_sq_thread+0x207/0x940
? do_wait_intr_irq+0xc0/0xc0
? __ia32_sys_io_uring_enter+0x650/0x650
kthread+0x134/0x180
? kthread_create_worker_on_cpu+0x90/0x90
ret_from_fork+0x1f/0x30
Fix this by moving the IO_WQ_WORK_NO_CANCEL until _after_ we've modified
the fdtable. Canceling before this point is totally fine, and running
it in the io-wq context _after_ that point is also fine.
For 5.12, we'll handle this internally and get rid of the no-cancel
flag, as IORING_OP_CLOSE is the only user of it.
Cc: stable@vger.kernel.org
Fixes: b5dba59e0c ("io_uring: add support for IORING_OP_CLOSE")
Reported-by: "Abaci <abaci@linux.alibaba.com>"
Reviewed-and-tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>