mirror of
https://mirrors.bfsu.edu.cn/git/linux.git
synced 2024-12-13 22:14:20 +08:00
ecae0bd517
included in this merge do the following: - Kemeng Shi has contributed some compation maintenance work in the series "Fixes and cleanups to compaction". - Joel Fernandes has a patchset ("Optimize mremap during mutual alignment within PMD") which fixes an obscure issue with mremap()'s pagetable handling during a subsequent exec(), based upon an implementation which Linus suggested. - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the following patch series: mm/damon: misc fixups for documents, comments and its tracepoint mm/damon: add a tracepoint for damos apply target regions mm/damon: provide pseudo-moving sum based access rate mm/damon: implement DAMOS apply intervals mm/damon/core-test: Fix memory leaks in core-test mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval - In the series "Do not try to access unaccepted memory" Adrian Hunter provides some fixups for the recently-added "unaccepted memory' feature. To increase the feature's checking coverage. "Plug a few gaps where RAM is exposed without checking if it is unaccepted memory". - In the series "cleanups for lockless slab shrink" Qi Zheng has done some maintenance work which is preparation for the lockless slab shrinking code. - Qi Zheng has redone the earlier (and reverted) attempt to make slab shrinking lockless in the series "use refcount+RCU method to implement lockless slab shrink". - David Hildenbrand contributes some maintenance work for the rmap code in the series "Anon rmap cleanups". - Kefeng Wang does more folio conversions and some maintenance work in the migration code. Series "mm: migrate: more folio conversion and unification". - Matthew Wilcox has fixed an issue in the buffer_head code which was causing long stalls under some heavy memory/IO loads. Some cleanups were added on the way. Series "Add and use bdev_getblk()". - In the series "Use nth_page() in place of direct struct page manipulation" Zi Yan has fixed a potential issue with the direct manipulation of hugetlb page frames. - In the series "mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO" has improved our handling of gigantic pages in the hugetlb vmmemmep optimizaton code. This provides significant boot time improvements when significant amounts of gigantic pages are in use. - Matthew Wilcox has sent the series "Small hugetlb cleanups" - code rationalization and folio conversions in the hugetlb code. - Yin Fengwei has improved mlock()'s handling of large folios in the series "support large folio for mlock" - In the series "Expose swapcache stat for memcg v1" Liu Shixin has added statistics for memcg v1 users which are available (and useful) under memcg v2. - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable) prctl so that userspace may direct the kernel to not automatically propagate the denial to child processes. The series is named "MDWE without inheritance". - Kefeng Wang has provided the series "mm: convert numa balancing functions to use a folio" which does what it says. - In the series "mm/ksm: add fork-exec support for prctl" Stefan Roesch makes is possible for a process to propagate KSM treatment across exec(). - Huang Ying has enhanced memory tiering's calculation of memory distances. This is used to permit the dax/kmem driver to use "high bandwidth memory" in addition to Optane Data Center Persistent Memory Modules (DCPMM). The series is named "memory tiering: calculate abstract distance based on ACPI HMAT" - In the series "Smart scanning mode for KSM" Stefan Roesch has optimized KSM by teaching it to retain and use some historical information from previous scans. - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the series "mm: memcg: fix tracking of pending stats updates values". - In the series "Implement IOCTL to get and optionally clear info about PTEs" Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits us to atomically read-then-clear page softdirty state. This is mainly used by CRIU. - Hugh Dickins contributed the series "shmem,tmpfs: general maintenance" - a bunch of relatively minor maintenance tweaks to this code. - Matthew Wilcox has increased the use of the VMA lock over file-backed page faults in the series "Handle more faults under the VMA lock". Some rationalizations of the fault path became possible as a result. - In the series "mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()" David Hildenbrand has implemented some cleanups and folio conversions. - In the series "various improvements to the GUP interface" Lorenzo Stoakes has simplified and improved the GUP interface with an eye to providing groundwork for future improvements. - Andrey Konovalov has sent along the series "kasan: assorted fixes and improvements" which does those things. - Some page allocator maintenance work from Kemeng Shi in the series "Two minor cleanups to break_down_buddy_pages". - In thes series "New selftest for mm" Breno Leitao has developed another MM self test which tickles a race we had between madvise() and page faults. - In the series "Add folio_end_read" Matthew Wilcox provides cleanups and an optimization to the core pagecache code. - Nhat Pham has added memcg accounting for hugetlb memory in the series "hugetlb memcg accounting". - Cleanups and rationalizations to the pagemap code from Lorenzo Stoakes, in the series "Abstract vma_merge() and split_vma()". - Audra Mitchell has fixed issues in the procfs page_owner code's new timestamping feature which was causing some misbehaviours. In the series "Fix page_owner's use of free timestamps". - Lorenzo Stoakes has fixed the handling of new mappings of sealed files in the series "permit write-sealed memfd read-only shared mappings". - Mike Kravetz has optimized the hugetlb vmemmap optimization in the series "Batch hugetlb vmemmap modification operations". - Some buffer_head folio conversions and cleanups from Matthew Wilcox in the series "Finish the create_empty_buffers() transition". - As a page allocator performance optimization Huang Ying has added automatic tuning to the allocator's per-cpu-pages feature, in the series "mm: PCP high auto-tuning". - Roman Gushchin has contributed the patchset "mm: improve performance of accounted kernel memory allocations" which improves their performance by ~30% as measured by a micro-benchmark. - folio conversions from Kefeng Wang in the series "mm: convert page cpupid functions to folios". - Some kmemleak fixups in Liu Shixin's series "Some bugfix about kmemleak". - Qi Zheng has improved our handling of memoryless nodes by keeping them off the allocation fallback list. This is done in the series "handle memoryless nodes more appropriately". - khugepaged conversions from Vishal Moola in the series "Some khugepaged folio conversions". -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZULEMwAKCRDdBJ7gKXxA jhQHAQCYpD3g849x69DmHnHWHm/EHQLvQmRMDeYZI+nx/sCJOwEAw4AKg0Oemv9y FgeUPAD1oasg6CP+INZvCj34waNxwAc= =E+Y4 -----END PGP SIGNATURE----- Merge tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Kemeng Shi has contributed some compation maintenance work in the series 'Fixes and cleanups to compaction' - Joel Fernandes has a patchset ('Optimize mremap during mutual alignment within PMD') which fixes an obscure issue with mremap()'s pagetable handling during a subsequent exec(), based upon an implementation which Linus suggested - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the following patch series: mm/damon: misc fixups for documents, comments and its tracepoint mm/damon: add a tracepoint for damos apply target regions mm/damon: provide pseudo-moving sum based access rate mm/damon: implement DAMOS apply intervals mm/damon/core-test: Fix memory leaks in core-test mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval - In the series 'Do not try to access unaccepted memory' Adrian Hunter provides some fixups for the recently-added 'unaccepted memory' feature. To increase the feature's checking coverage. 'Plug a few gaps where RAM is exposed without checking if it is unaccepted memory' - In the series 'cleanups for lockless slab shrink' Qi Zheng has done some maintenance work which is preparation for the lockless slab shrinking code - Qi Zheng has redone the earlier (and reverted) attempt to make slab shrinking lockless in the series 'use refcount+RCU method to implement lockless slab shrink' - David Hildenbrand contributes some maintenance work for the rmap code in the series 'Anon rmap cleanups' - Kefeng Wang does more folio conversions and some maintenance work in the migration code. Series 'mm: migrate: more folio conversion and unification' - Matthew Wilcox has fixed an issue in the buffer_head code which was causing long stalls under some heavy memory/IO loads. Some cleanups were added on the way. Series 'Add and use bdev_getblk()' - In the series 'Use nth_page() in place of direct struct page manipulation' Zi Yan has fixed a potential issue with the direct manipulation of hugetlb page frames - In the series 'mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO' has improved our handling of gigantic pages in the hugetlb vmmemmep optimizaton code. This provides significant boot time improvements when significant amounts of gigantic pages are in use - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code rationalization and folio conversions in the hugetlb code - Yin Fengwei has improved mlock()'s handling of large folios in the series 'support large folio for mlock' - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has added statistics for memcg v1 users which are available (and useful) under memcg v2 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable) prctl so that userspace may direct the kernel to not automatically propagate the denial to child processes. The series is named 'MDWE without inheritance' - Kefeng Wang has provided the series 'mm: convert numa balancing functions to use a folio' which does what it says - In the series 'mm/ksm: add fork-exec support for prctl' Stefan Roesch makes is possible for a process to propagate KSM treatment across exec() - Huang Ying has enhanced memory tiering's calculation of memory distances. This is used to permit the dax/kmem driver to use 'high bandwidth memory' in addition to Optane Data Center Persistent Memory Modules (DCPMM). The series is named 'memory tiering: calculate abstract distance based on ACPI HMAT' - In the series 'Smart scanning mode for KSM' Stefan Roesch has optimized KSM by teaching it to retain and use some historical information from previous scans - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the series 'mm: memcg: fix tracking of pending stats updates values' - In the series 'Implement IOCTL to get and optionally clear info about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits us to atomically read-then-clear page softdirty state. This is mainly used by CRIU - Hugh Dickins contributed the series 'shmem,tmpfs: general maintenance', a bunch of relatively minor maintenance tweaks to this code - Matthew Wilcox has increased the use of the VMA lock over file-backed page faults in the series 'Handle more faults under the VMA lock'. Some rationalizations of the fault path became possible as a result - In the series 'mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()' David Hildenbrand has implemented some cleanups and folio conversions - In the series 'various improvements to the GUP interface' Lorenzo Stoakes has simplified and improved the GUP interface with an eye to providing groundwork for future improvements - Andrey Konovalov has sent along the series 'kasan: assorted fixes and improvements' which does those things - Some page allocator maintenance work from Kemeng Shi in the series 'Two minor cleanups to break_down_buddy_pages' - In thes series 'New selftest for mm' Breno Leitao has developed another MM self test which tickles a race we had between madvise() and page faults - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups and an optimization to the core pagecache code - Nhat Pham has added memcg accounting for hugetlb memory in the series 'hugetlb memcg accounting' - Cleanups and rationalizations to the pagemap code from Lorenzo Stoakes, in the series 'Abstract vma_merge() and split_vma()' - Audra Mitchell has fixed issues in the procfs page_owner code's new timestamping feature which was causing some misbehaviours. In the series 'Fix page_owner's use of free timestamps' - Lorenzo Stoakes has fixed the handling of new mappings of sealed files in the series 'permit write-sealed memfd read-only shared mappings' - Mike Kravetz has optimized the hugetlb vmemmap optimization in the series 'Batch hugetlb vmemmap modification operations' - Some buffer_head folio conversions and cleanups from Matthew Wilcox in the series 'Finish the create_empty_buffers() transition' - As a page allocator performance optimization Huang Ying has added automatic tuning to the allocator's per-cpu-pages feature, in the series 'mm: PCP high auto-tuning' - Roman Gushchin has contributed the patchset 'mm: improve performance of accounted kernel memory allocations' which improves their performance by ~30% as measured by a micro-benchmark - folio conversions from Kefeng Wang in the series 'mm: convert page cpupid functions to folios' - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about kmemleak' - Qi Zheng has improved our handling of memoryless nodes by keeping them off the allocation fallback list. This is done in the series 'handle memoryless nodes more appropriately' - khugepaged conversions from Vishal Moola in the series 'Some khugepaged folio conversions'" [ bcachefs conflicts with the dynamically allocated shrinkers have been resolved as per Stephen Rothwell in https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/ with help from Qi Zheng. The clone3 test filtering conflict was half-arsed by yours truly ] * tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits) mm/damon/sysfs: update monitoring target regions for online input commit mm/damon/sysfs: remove requested targets when online-commit inputs selftests: add a sanity check for zswap Documentation: maple_tree: fix word spelling error mm/vmalloc: fix the unchecked dereference warning in vread_iter() zswap: export compression failure stats Documentation: ubsan: drop "the" from article title mempolicy: migration attempt to match interleave nodes mempolicy: mmap_lock is not needed while migrating folios mempolicy: alloc_pages_mpol() for NUMA policy without vma mm: add page_rmappable_folio() wrapper mempolicy: remove confusing MPOL_MF_LAZY dead code mempolicy: mpol_shared_policy_init() without pseudo-vma mempolicy trivia: use pgoff_t in shared mempolicy tree mempolicy trivia: slightly more consistent naming mempolicy trivia: delete those ancient pr_debug()s mempolicy: fix migrate_pages(2) syscall return nr_failed kernfs: drop shared NUMA mempolicy hooks hugetlbfs: drop shared NUMA mempolicy pretence mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets() ...
2485 lines
62 KiB
C
2485 lines
62 KiB
C
// SPDX-License-Identifier: GPL-2.0-or-later
|
||
/*
|
||
* Copyright (C) 2002, 2004 Oracle. All rights reserved.
|
||
*/
|
||
|
||
#include <linux/fs.h>
|
||
#include <linux/slab.h>
|
||
#include <linux/highmem.h>
|
||
#include <linux/pagemap.h>
|
||
#include <asm/byteorder.h>
|
||
#include <linux/swap.h>
|
||
#include <linux/mpage.h>
|
||
#include <linux/quotaops.h>
|
||
#include <linux/blkdev.h>
|
||
#include <linux/uio.h>
|
||
#include <linux/mm.h>
|
||
|
||
#include <cluster/masklog.h>
|
||
|
||
#include "ocfs2.h"
|
||
|
||
#include "alloc.h"
|
||
#include "aops.h"
|
||
#include "dlmglue.h"
|
||
#include "extent_map.h"
|
||
#include "file.h"
|
||
#include "inode.h"
|
||
#include "journal.h"
|
||
#include "suballoc.h"
|
||
#include "super.h"
|
||
#include "symlink.h"
|
||
#include "refcounttree.h"
|
||
#include "ocfs2_trace.h"
|
||
|
||
#include "buffer_head_io.h"
|
||
#include "dir.h"
|
||
#include "namei.h"
|
||
#include "sysfile.h"
|
||
|
||
static int ocfs2_symlink_get_block(struct inode *inode, sector_t iblock,
|
||
struct buffer_head *bh_result, int create)
|
||
{
|
||
int err = -EIO;
|
||
int status;
|
||
struct ocfs2_dinode *fe = NULL;
|
||
struct buffer_head *bh = NULL;
|
||
struct buffer_head *buffer_cache_bh = NULL;
|
||
struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
|
||
void *kaddr;
|
||
|
||
trace_ocfs2_symlink_get_block(
|
||
(unsigned long long)OCFS2_I(inode)->ip_blkno,
|
||
(unsigned long long)iblock, bh_result, create);
|
||
|
||
BUG_ON(ocfs2_inode_is_fast_symlink(inode));
|
||
|
||
if ((iblock << inode->i_sb->s_blocksize_bits) > PATH_MAX + 1) {
|
||
mlog(ML_ERROR, "block offset > PATH_MAX: %llu",
|
||
(unsigned long long)iblock);
|
||
goto bail;
|
||
}
|
||
|
||
status = ocfs2_read_inode_block(inode, &bh);
|
||
if (status < 0) {
|
||
mlog_errno(status);
|
||
goto bail;
|
||
}
|
||
fe = (struct ocfs2_dinode *) bh->b_data;
|
||
|
||
if ((u64)iblock >= ocfs2_clusters_to_blocks(inode->i_sb,
|
||
le32_to_cpu(fe->i_clusters))) {
|
||
err = -ENOMEM;
|
||
mlog(ML_ERROR, "block offset is outside the allocated size: "
|
||
"%llu\n", (unsigned long long)iblock);
|
||
goto bail;
|
||
}
|
||
|
||
/* We don't use the page cache to create symlink data, so if
|
||
* need be, copy it over from the buffer cache. */
|
||
if (!buffer_uptodate(bh_result) && ocfs2_inode_is_new(inode)) {
|
||
u64 blkno = le64_to_cpu(fe->id2.i_list.l_recs[0].e_blkno) +
|
||
iblock;
|
||
buffer_cache_bh = sb_getblk(osb->sb, blkno);
|
||
if (!buffer_cache_bh) {
|
||
err = -ENOMEM;
|
||
mlog(ML_ERROR, "couldn't getblock for symlink!\n");
|
||
goto bail;
|
||
}
|
||
|
||
/* we haven't locked out transactions, so a commit
|
||
* could've happened. Since we've got a reference on
|
||
* the bh, even if it commits while we're doing the
|
||
* copy, the data is still good. */
|
||
if (buffer_jbd(buffer_cache_bh)
|
||
&& ocfs2_inode_is_new(inode)) {
|
||
kaddr = kmap_atomic(bh_result->b_page);
|
||
if (!kaddr) {
|
||
mlog(ML_ERROR, "couldn't kmap!\n");
|
||
goto bail;
|
||
}
|
||
memcpy(kaddr + (bh_result->b_size * iblock),
|
||
buffer_cache_bh->b_data,
|
||
bh_result->b_size);
|
||
kunmap_atomic(kaddr);
|
||
set_buffer_uptodate(bh_result);
|
||
}
|
||
brelse(buffer_cache_bh);
|
||
}
|
||
|
||
map_bh(bh_result, inode->i_sb,
|
||
le64_to_cpu(fe->id2.i_list.l_recs[0].e_blkno) + iblock);
|
||
|
||
err = 0;
|
||
|
||
bail:
|
||
brelse(bh);
|
||
|
||
return err;
|
||
}
|
||
|
||
static int ocfs2_lock_get_block(struct inode *inode, sector_t iblock,
|
||
struct buffer_head *bh_result, int create)
|
||
{
|
||
int ret = 0;
|
||
struct ocfs2_inode_info *oi = OCFS2_I(inode);
|
||
|
||
down_read(&oi->ip_alloc_sem);
|
||
ret = ocfs2_get_block(inode, iblock, bh_result, create);
|
||
up_read(&oi->ip_alloc_sem);
|
||
|
||
return ret;
|
||
}
|
||
|
||
int ocfs2_get_block(struct inode *inode, sector_t iblock,
|
||
struct buffer_head *bh_result, int create)
|
||
{
|
||
int err = 0;
|
||
unsigned int ext_flags;
|
||
u64 max_blocks = bh_result->b_size >> inode->i_blkbits;
|
||
u64 p_blkno, count, past_eof;
|
||
struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
|
||
|
||
trace_ocfs2_get_block((unsigned long long)OCFS2_I(inode)->ip_blkno,
|
||
(unsigned long long)iblock, bh_result, create);
|
||
|
||
if (OCFS2_I(inode)->ip_flags & OCFS2_INODE_SYSTEM_FILE)
|
||
mlog(ML_NOTICE, "get_block on system inode 0x%p (%lu)\n",
|
||
inode, inode->i_ino);
|
||
|
||
if (S_ISLNK(inode->i_mode)) {
|
||
/* this always does I/O for some reason. */
|
||
err = ocfs2_symlink_get_block(inode, iblock, bh_result, create);
|
||
goto bail;
|
||
}
|
||
|
||
err = ocfs2_extent_map_get_blocks(inode, iblock, &p_blkno, &count,
|
||
&ext_flags);
|
||
if (err) {
|
||
mlog(ML_ERROR, "Error %d from get_blocks(0x%p, %llu, 1, "
|
||
"%llu, NULL)\n", err, inode, (unsigned long long)iblock,
|
||
(unsigned long long)p_blkno);
|
||
goto bail;
|
||
}
|
||
|
||
if (max_blocks < count)
|
||
count = max_blocks;
|
||
|
||
/*
|
||
* ocfs2 never allocates in this function - the only time we
|
||
* need to use BH_New is when we're extending i_size on a file
|
||
* system which doesn't support holes, in which case BH_New
|
||
* allows __block_write_begin() to zero.
|
||
*
|
||
* If we see this on a sparse file system, then a truncate has
|
||
* raced us and removed the cluster. In this case, we clear
|
||
* the buffers dirty and uptodate bits and let the buffer code
|
||
* ignore it as a hole.
|
||
*/
|
||
if (create && p_blkno == 0 && ocfs2_sparse_alloc(osb)) {
|
||
clear_buffer_dirty(bh_result);
|
||
clear_buffer_uptodate(bh_result);
|
||
goto bail;
|
||
}
|
||
|
||
/* Treat the unwritten extent as a hole for zeroing purposes. */
|
||
if (p_blkno && !(ext_flags & OCFS2_EXT_UNWRITTEN))
|
||
map_bh(bh_result, inode->i_sb, p_blkno);
|
||
|
||
bh_result->b_size = count << inode->i_blkbits;
|
||
|
||
if (!ocfs2_sparse_alloc(osb)) {
|
||
if (p_blkno == 0) {
|
||
err = -EIO;
|
||
mlog(ML_ERROR,
|
||
"iblock = %llu p_blkno = %llu blkno=(%llu)\n",
|
||
(unsigned long long)iblock,
|
||
(unsigned long long)p_blkno,
|
||
(unsigned long long)OCFS2_I(inode)->ip_blkno);
|
||
mlog(ML_ERROR, "Size %llu, clusters %u\n", (unsigned long long)i_size_read(inode), OCFS2_I(inode)->ip_clusters);
|
||
dump_stack();
|
||
goto bail;
|
||
}
|
||
}
|
||
|
||
past_eof = ocfs2_blocks_for_bytes(inode->i_sb, i_size_read(inode));
|
||
|
||
trace_ocfs2_get_block_end((unsigned long long)OCFS2_I(inode)->ip_blkno,
|
||
(unsigned long long)past_eof);
|
||
if (create && (iblock >= past_eof))
|
||
set_buffer_new(bh_result);
|
||
|
||
bail:
|
||
if (err < 0)
|
||
err = -EIO;
|
||
|
||
return err;
|
||
}
|
||
|
||
int ocfs2_read_inline_data(struct inode *inode, struct page *page,
|
||
struct buffer_head *di_bh)
|
||
{
|
||
void *kaddr;
|
||
loff_t size;
|
||
struct ocfs2_dinode *di = (struct ocfs2_dinode *)di_bh->b_data;
|
||
|
||
if (!(le16_to_cpu(di->i_dyn_features) & OCFS2_INLINE_DATA_FL)) {
|
||
ocfs2_error(inode->i_sb, "Inode %llu lost inline data flag\n",
|
||
(unsigned long long)OCFS2_I(inode)->ip_blkno);
|
||
return -EROFS;
|
||
}
|
||
|
||
size = i_size_read(inode);
|
||
|
||
if (size > PAGE_SIZE ||
|
||
size > ocfs2_max_inline_data_with_xattr(inode->i_sb, di)) {
|
||
ocfs2_error(inode->i_sb,
|
||
"Inode %llu has with inline data has bad size: %Lu\n",
|
||
(unsigned long long)OCFS2_I(inode)->ip_blkno,
|
||
(unsigned long long)size);
|
||
return -EROFS;
|
||
}
|
||
|
||
kaddr = kmap_atomic(page);
|
||
if (size)
|
||
memcpy(kaddr, di->id2.i_data.id_data, size);
|
||
/* Clear the remaining part of the page */
|
||
memset(kaddr + size, 0, PAGE_SIZE - size);
|
||
flush_dcache_page(page);
|
||
kunmap_atomic(kaddr);
|
||
|
||
SetPageUptodate(page);
|
||
|
||
return 0;
|
||
}
|
||
|
||
static int ocfs2_readpage_inline(struct inode *inode, struct page *page)
|
||
{
|
||
int ret;
|
||
struct buffer_head *di_bh = NULL;
|
||
|
||
BUG_ON(!PageLocked(page));
|
||
BUG_ON(!(OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL));
|
||
|
||
ret = ocfs2_read_inode_block(inode, &di_bh);
|
||
if (ret) {
|
||
mlog_errno(ret);
|
||
goto out;
|
||
}
|
||
|
||
ret = ocfs2_read_inline_data(inode, page, di_bh);
|
||
out:
|
||
unlock_page(page);
|
||
|
||
brelse(di_bh);
|
||
return ret;
|
||
}
|
||
|
||
static int ocfs2_read_folio(struct file *file, struct folio *folio)
|
||
{
|
||
struct inode *inode = folio->mapping->host;
|
||
struct ocfs2_inode_info *oi = OCFS2_I(inode);
|
||
loff_t start = folio_pos(folio);
|
||
int ret, unlock = 1;
|
||
|
||
trace_ocfs2_readpage((unsigned long long)oi->ip_blkno, folio->index);
|
||
|
||
ret = ocfs2_inode_lock_with_page(inode, NULL, 0, &folio->page);
|
||
if (ret != 0) {
|
||
if (ret == AOP_TRUNCATED_PAGE)
|
||
unlock = 0;
|
||
mlog_errno(ret);
|
||
goto out;
|
||
}
|
||
|
||
if (down_read_trylock(&oi->ip_alloc_sem) == 0) {
|
||
/*
|
||
* Unlock the folio and cycle ip_alloc_sem so that we don't
|
||
* busyloop waiting for ip_alloc_sem to unlock
|
||
*/
|
||
ret = AOP_TRUNCATED_PAGE;
|
||
folio_unlock(folio);
|
||
unlock = 0;
|
||
down_read(&oi->ip_alloc_sem);
|
||
up_read(&oi->ip_alloc_sem);
|
||
goto out_inode_unlock;
|
||
}
|
||
|
||
/*
|
||
* i_size might have just been updated as we grabed the meta lock. We
|
||
* might now be discovering a truncate that hit on another node.
|
||
* block_read_full_folio->get_block freaks out if it is asked to read
|
||
* beyond the end of a file, so we check here. Callers
|
||
* (generic_file_read, vm_ops->fault) are clever enough to check i_size
|
||
* and notice that the folio they just read isn't needed.
|
||
*
|
||
* XXX sys_readahead() seems to get that wrong?
|
||
*/
|
||
if (start >= i_size_read(inode)) {
|
||
folio_zero_segment(folio, 0, folio_size(folio));
|
||
folio_mark_uptodate(folio);
|
||
ret = 0;
|
||
goto out_alloc;
|
||
}
|
||
|
||
if (oi->ip_dyn_features & OCFS2_INLINE_DATA_FL)
|
||
ret = ocfs2_readpage_inline(inode, &folio->page);
|
||
else
|
||
ret = block_read_full_folio(folio, ocfs2_get_block);
|
||
unlock = 0;
|
||
|
||
out_alloc:
|
||
up_read(&oi->ip_alloc_sem);
|
||
out_inode_unlock:
|
||
ocfs2_inode_unlock(inode, 0);
|
||
out:
|
||
if (unlock)
|
||
folio_unlock(folio);
|
||
return ret;
|
||
}
|
||
|
||
/*
|
||
* This is used only for read-ahead. Failures or difficult to handle
|
||
* situations are safe to ignore.
|
||
*
|
||
* Right now, we don't bother with BH_Boundary - in-inode extent lists
|
||
* are quite large (243 extents on 4k blocks), so most inodes don't
|
||
* grow out to a tree. If need be, detecting boundary extents could
|
||
* trivially be added in a future version of ocfs2_get_block().
|
||
*/
|
||
static void ocfs2_readahead(struct readahead_control *rac)
|
||
{
|
||
int ret;
|
||
struct inode *inode = rac->mapping->host;
|
||
struct ocfs2_inode_info *oi = OCFS2_I(inode);
|
||
|
||
/*
|
||
* Use the nonblocking flag for the dlm code to avoid page
|
||
* lock inversion, but don't bother with retrying.
|
||
*/
|
||
ret = ocfs2_inode_lock_full(inode, NULL, 0, OCFS2_LOCK_NONBLOCK);
|
||
if (ret)
|
||
return;
|
||
|
||
if (down_read_trylock(&oi->ip_alloc_sem) == 0)
|
||
goto out_unlock;
|
||
|
||
/*
|
||
* Don't bother with inline-data. There isn't anything
|
||
* to read-ahead in that case anyway...
|
||
*/
|
||
if (oi->ip_dyn_features & OCFS2_INLINE_DATA_FL)
|
||
goto out_up;
|
||
|
||
/*
|
||
* Check whether a remote node truncated this file - we just
|
||
* drop out in that case as it's not worth handling here.
|
||
*/
|
||
if (readahead_pos(rac) >= i_size_read(inode))
|
||
goto out_up;
|
||
|
||
mpage_readahead(rac, ocfs2_get_block);
|
||
|
||
out_up:
|
||
up_read(&oi->ip_alloc_sem);
|
||
out_unlock:
|
||
ocfs2_inode_unlock(inode, 0);
|
||
}
|
||
|
||
/* Note: Because we don't support holes, our allocation has
|
||
* already happened (allocation writes zeros to the file data)
|
||
* so we don't have to worry about ordered writes in
|
||
* ocfs2_writepage.
|
||
*
|
||
* ->writepage is called during the process of invalidating the page cache
|
||
* during blocked lock processing. It can't block on any cluster locks
|
||
* to during block mapping. It's relying on the fact that the block
|
||
* mapping can't have disappeared under the dirty pages that it is
|
||
* being asked to write back.
|
||
*/
|
||
static int ocfs2_writepage(struct page *page, struct writeback_control *wbc)
|
||
{
|
||
trace_ocfs2_writepage(
|
||
(unsigned long long)OCFS2_I(page->mapping->host)->ip_blkno,
|
||
page->index);
|
||
|
||
return block_write_full_page(page, ocfs2_get_block, wbc);
|
||
}
|
||
|
||
/* Taken from ext3. We don't necessarily need the full blown
|
||
* functionality yet, but IMHO it's better to cut and paste the whole
|
||
* thing so we can avoid introducing our own bugs (and easily pick up
|
||
* their fixes when they happen) --Mark */
|
||
int walk_page_buffers( handle_t *handle,
|
||
struct buffer_head *head,
|
||
unsigned from,
|
||
unsigned to,
|
||
int *partial,
|
||
int (*fn)( handle_t *handle,
|
||
struct buffer_head *bh))
|
||
{
|
||
struct buffer_head *bh;
|
||
unsigned block_start, block_end;
|
||
unsigned blocksize = head->b_size;
|
||
int err, ret = 0;
|
||
struct buffer_head *next;
|
||
|
||
for ( bh = head, block_start = 0;
|
||
ret == 0 && (bh != head || !block_start);
|
||
block_start = block_end, bh = next)
|
||
{
|
||
next = bh->b_this_page;
|
||
block_end = block_start + blocksize;
|
||
if (block_end <= from || block_start >= to) {
|
||
if (partial && !buffer_uptodate(bh))
|
||
*partial = 1;
|
||
continue;
|
||
}
|
||
err = (*fn)(handle, bh);
|
||
if (!ret)
|
||
ret = err;
|
||
}
|
||
return ret;
|
||
}
|
||
|
||
static sector_t ocfs2_bmap(struct address_space *mapping, sector_t block)
|
||
{
|
||
sector_t status;
|
||
u64 p_blkno = 0;
|
||
int err = 0;
|
||
struct inode *inode = mapping->host;
|
||
|
||
trace_ocfs2_bmap((unsigned long long)OCFS2_I(inode)->ip_blkno,
|
||
(unsigned long long)block);
|
||
|
||
/*
|
||
* The swap code (ab-)uses ->bmap to get a block mapping and then
|
||
* bypasseѕ the file system for actual I/O. We really can't allow
|
||
* that on refcounted inodes, so we have to skip out here. And yes,
|
||
* 0 is the magic code for a bmap error..
|
||
*/
|
||
if (ocfs2_is_refcount_inode(inode))
|
||
return 0;
|
||
|
||
/* We don't need to lock journal system files, since they aren't
|
||
* accessed concurrently from multiple nodes.
|
||
*/
|
||
if (!INODE_JOURNAL(inode)) {
|
||
err = ocfs2_inode_lock(inode, NULL, 0);
|
||
if (err) {
|
||
if (err != -ENOENT)
|
||
mlog_errno(err);
|
||
goto bail;
|
||
}
|
||
down_read(&OCFS2_I(inode)->ip_alloc_sem);
|
||
}
|
||
|
||
if (!(OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL))
|
||
err = ocfs2_extent_map_get_blocks(inode, block, &p_blkno, NULL,
|
||
NULL);
|
||
|
||
if (!INODE_JOURNAL(inode)) {
|
||
up_read(&OCFS2_I(inode)->ip_alloc_sem);
|
||
ocfs2_inode_unlock(inode, 0);
|
||
}
|
||
|
||
if (err) {
|
||
mlog(ML_ERROR, "get_blocks() failed, block = %llu\n",
|
||
(unsigned long long)block);
|
||
mlog_errno(err);
|
||
goto bail;
|
||
}
|
||
|
||
bail:
|
||
status = err ? 0 : p_blkno;
|
||
|
||
return status;
|
||
}
|
||
|
||
static bool ocfs2_release_folio(struct folio *folio, gfp_t wait)
|
||
{
|
||
if (!folio_buffers(folio))
|
||
return false;
|
||
return try_to_free_buffers(folio);
|
||
}
|
||
|
||
static void ocfs2_figure_cluster_boundaries(struct ocfs2_super *osb,
|
||
u32 cpos,
|
||
unsigned int *start,
|
||
unsigned int *end)
|
||
{
|
||
unsigned int cluster_start = 0, cluster_end = PAGE_SIZE;
|
||
|
||
if (unlikely(PAGE_SHIFT > osb->s_clustersize_bits)) {
|
||
unsigned int cpp;
|
||
|
||
cpp = 1 << (PAGE_SHIFT - osb->s_clustersize_bits);
|
||
|
||
cluster_start = cpos % cpp;
|
||
cluster_start = cluster_start << osb->s_clustersize_bits;
|
||
|
||
cluster_end = cluster_start + osb->s_clustersize;
|
||
}
|
||
|
||
BUG_ON(cluster_start > PAGE_SIZE);
|
||
BUG_ON(cluster_end > PAGE_SIZE);
|
||
|
||
if (start)
|
||
*start = cluster_start;
|
||
if (end)
|
||
*end = cluster_end;
|
||
}
|
||
|
||
/*
|
||
* 'from' and 'to' are the region in the page to avoid zeroing.
|
||
*
|
||
* If pagesize > clustersize, this function will avoid zeroing outside
|
||
* of the cluster boundary.
|
||
*
|
||
* from == to == 0 is code for "zero the entire cluster region"
|
||
*/
|
||
static void ocfs2_clear_page_regions(struct page *page,
|
||
struct ocfs2_super *osb, u32 cpos,
|
||
unsigned from, unsigned to)
|
||
{
|
||
void *kaddr;
|
||
unsigned int cluster_start, cluster_end;
|
||
|
||
ocfs2_figure_cluster_boundaries(osb, cpos, &cluster_start, &cluster_end);
|
||
|
||
kaddr = kmap_atomic(page);
|
||
|
||
if (from || to) {
|
||
if (from > cluster_start)
|
||
memset(kaddr + cluster_start, 0, from - cluster_start);
|
||
if (to < cluster_end)
|
||
memset(kaddr + to, 0, cluster_end - to);
|
||
} else {
|
||
memset(kaddr + cluster_start, 0, cluster_end - cluster_start);
|
||
}
|
||
|
||
kunmap_atomic(kaddr);
|
||
}
|
||
|
||
/*
|
||
* Nonsparse file systems fully allocate before we get to the write
|
||
* code. This prevents ocfs2_write() from tagging the write as an
|
||
* allocating one, which means ocfs2_map_page_blocks() might try to
|
||
* read-in the blocks at the tail of our file. Avoid reading them by
|
||
* testing i_size against each block offset.
|
||
*/
|
||
static int ocfs2_should_read_blk(struct inode *inode, struct folio *folio,
|
||
unsigned int block_start)
|
||
{
|
||
u64 offset = folio_pos(folio) + block_start;
|
||
|
||
if (ocfs2_sparse_alloc(OCFS2_SB(inode->i_sb)))
|
||
return 1;
|
||
|
||
if (i_size_read(inode) > offset)
|
||
return 1;
|
||
|
||
return 0;
|
||
}
|
||
|
||
/*
|
||
* Some of this taken from __block_write_begin(). We already have our
|
||
* mapping by now though, and the entire write will be allocating or
|
||
* it won't, so not much need to use BH_New.
|
||
*
|
||
* This will also skip zeroing, which is handled externally.
|
||
*/
|
||
int ocfs2_map_page_blocks(struct page *page, u64 *p_blkno,
|
||
struct inode *inode, unsigned int from,
|
||
unsigned int to, int new)
|
||
{
|
||
struct folio *folio = page_folio(page);
|
||
int ret = 0;
|
||
struct buffer_head *head, *bh, *wait[2], **wait_bh = wait;
|
||
unsigned int block_end, block_start;
|
||
unsigned int bsize = i_blocksize(inode);
|
||
|
||
head = folio_buffers(folio);
|
||
if (!head)
|
||
head = create_empty_buffers(folio, bsize, 0);
|
||
|
||
for (bh = head, block_start = 0; bh != head || !block_start;
|
||
bh = bh->b_this_page, block_start += bsize) {
|
||
block_end = block_start + bsize;
|
||
|
||
clear_buffer_new(bh);
|
||
|
||
/*
|
||
* Ignore blocks outside of our i/o range -
|
||
* they may belong to unallocated clusters.
|
||
*/
|
||
if (block_start >= to || block_end <= from) {
|
||
if (folio_test_uptodate(folio))
|
||
set_buffer_uptodate(bh);
|
||
continue;
|
||
}
|
||
|
||
/*
|
||
* For an allocating write with cluster size >= page
|
||
* size, we always write the entire page.
|
||
*/
|
||
if (new)
|
||
set_buffer_new(bh);
|
||
|
||
if (!buffer_mapped(bh)) {
|
||
map_bh(bh, inode->i_sb, *p_blkno);
|
||
clean_bdev_bh_alias(bh);
|
||
}
|
||
|
||
if (folio_test_uptodate(folio)) {
|
||
set_buffer_uptodate(bh);
|
||
} else if (!buffer_uptodate(bh) && !buffer_delay(bh) &&
|
||
!buffer_new(bh) &&
|
||
ocfs2_should_read_blk(inode, folio, block_start) &&
|
||
(block_start < from || block_end > to)) {
|
||
bh_read_nowait(bh, 0);
|
||
*wait_bh++=bh;
|
||
}
|
||
|
||
*p_blkno = *p_blkno + 1;
|
||
}
|
||
|
||
/*
|
||
* If we issued read requests - let them complete.
|
||
*/
|
||
while(wait_bh > wait) {
|
||
wait_on_buffer(*--wait_bh);
|
||
if (!buffer_uptodate(*wait_bh))
|
||
ret = -EIO;
|
||
}
|
||
|
||
if (ret == 0 || !new)
|
||
return ret;
|
||
|
||
/*
|
||
* If we get -EIO above, zero out any newly allocated blocks
|
||
* to avoid exposing stale data.
|
||
*/
|
||
bh = head;
|
||
block_start = 0;
|
||
do {
|
||
block_end = block_start + bsize;
|
||
if (block_end <= from)
|
||
goto next_bh;
|
||
if (block_start >= to)
|
||
break;
|
||
|
||
folio_zero_range(folio, block_start, bh->b_size);
|
||
set_buffer_uptodate(bh);
|
||
mark_buffer_dirty(bh);
|
||
|
||
next_bh:
|
||
block_start = block_end;
|
||
bh = bh->b_this_page;
|
||
} while (bh != head);
|
||
|
||
return ret;
|
||
}
|
||
|
||
#if (PAGE_SIZE >= OCFS2_MAX_CLUSTERSIZE)
|
||
#define OCFS2_MAX_CTXT_PAGES 1
|
||
#else
|
||
#define OCFS2_MAX_CTXT_PAGES (OCFS2_MAX_CLUSTERSIZE / PAGE_SIZE)
|
||
#endif
|
||
|
||
#define OCFS2_MAX_CLUSTERS_PER_PAGE (PAGE_SIZE / OCFS2_MIN_CLUSTERSIZE)
|
||
|
||
struct ocfs2_unwritten_extent {
|
||
struct list_head ue_node;
|
||
struct list_head ue_ip_node;
|
||
u32 ue_cpos;
|
||
u32 ue_phys;
|
||
};
|
||
|
||
/*
|
||
* Describe the state of a single cluster to be written to.
|
||
*/
|
||
struct ocfs2_write_cluster_desc {
|
||
u32 c_cpos;
|
||
u32 c_phys;
|
||
/*
|
||
* Give this a unique field because c_phys eventually gets
|
||
* filled.
|
||
*/
|
||
unsigned c_new;
|
||
unsigned c_clear_unwritten;
|
||
unsigned c_needs_zero;
|
||
};
|
||
|
||
struct ocfs2_write_ctxt {
|
||
/* Logical cluster position / len of write */
|
||
u32 w_cpos;
|
||
u32 w_clen;
|
||
|
||
/* First cluster allocated in a nonsparse extend */
|
||
u32 w_first_new_cpos;
|
||
|
||
/* Type of caller. Must be one of buffer, mmap, direct. */
|
||
ocfs2_write_type_t w_type;
|
||
|
||
struct ocfs2_write_cluster_desc w_desc[OCFS2_MAX_CLUSTERS_PER_PAGE];
|
||
|
||
/*
|
||
* This is true if page_size > cluster_size.
|
||
*
|
||
* It triggers a set of special cases during write which might
|
||
* have to deal with allocating writes to partial pages.
|
||
*/
|
||
unsigned int w_large_pages;
|
||
|
||
/*
|
||
* Pages involved in this write.
|
||
*
|
||
* w_target_page is the page being written to by the user.
|
||
*
|
||
* w_pages is an array of pages which always contains
|
||
* w_target_page, and in the case of an allocating write with
|
||
* page_size < cluster size, it will contain zero'd and mapped
|
||
* pages adjacent to w_target_page which need to be written
|
||
* out in so that future reads from that region will get
|
||
* zero's.
|
||
*/
|
||
unsigned int w_num_pages;
|
||
struct page *w_pages[OCFS2_MAX_CTXT_PAGES];
|
||
struct page *w_target_page;
|
||
|
||
/*
|
||
* w_target_locked is used for page_mkwrite path indicating no unlocking
|
||
* against w_target_page in ocfs2_write_end_nolock.
|
||
*/
|
||
unsigned int w_target_locked:1;
|
||
|
||
/*
|
||
* ocfs2_write_end() uses this to know what the real range to
|
||
* write in the target should be.
|
||
*/
|
||
unsigned int w_target_from;
|
||
unsigned int w_target_to;
|
||
|
||
/*
|
||
* We could use journal_current_handle() but this is cleaner,
|
||
* IMHO -Mark
|
||
*/
|
||
handle_t *w_handle;
|
||
|
||
struct buffer_head *w_di_bh;
|
||
|
||
struct ocfs2_cached_dealloc_ctxt w_dealloc;
|
||
|
||
struct list_head w_unwritten_list;
|
||
unsigned int w_unwritten_count;
|
||
};
|
||
|
||
void ocfs2_unlock_and_free_pages(struct page **pages, int num_pages)
|
||
{
|
||
int i;
|
||
|
||
for(i = 0; i < num_pages; i++) {
|
||
if (pages[i]) {
|
||
unlock_page(pages[i]);
|
||
mark_page_accessed(pages[i]);
|
||
put_page(pages[i]);
|
||
}
|
||
}
|
||
}
|
||
|
||
static void ocfs2_unlock_pages(struct ocfs2_write_ctxt *wc)
|
||
{
|
||
int i;
|
||
|
||
/*
|
||
* w_target_locked is only set to true in the page_mkwrite() case.
|
||
* The intent is to allow us to lock the target page from write_begin()
|
||
* to write_end(). The caller must hold a ref on w_target_page.
|
||
*/
|
||
if (wc->w_target_locked) {
|
||
BUG_ON(!wc->w_target_page);
|
||
for (i = 0; i < wc->w_num_pages; i++) {
|
||
if (wc->w_target_page == wc->w_pages[i]) {
|
||
wc->w_pages[i] = NULL;
|
||
break;
|
||
}
|
||
}
|
||
mark_page_accessed(wc->w_target_page);
|
||
put_page(wc->w_target_page);
|
||
}
|
||
ocfs2_unlock_and_free_pages(wc->w_pages, wc->w_num_pages);
|
||
}
|
||
|
||
static void ocfs2_free_unwritten_list(struct inode *inode,
|
||
struct list_head *head)
|
||
{
|
||
struct ocfs2_inode_info *oi = OCFS2_I(inode);
|
||
struct ocfs2_unwritten_extent *ue = NULL, *tmp = NULL;
|
||
|
||
list_for_each_entry_safe(ue, tmp, head, ue_node) {
|
||
list_del(&ue->ue_node);
|
||
spin_lock(&oi->ip_lock);
|
||
list_del(&ue->ue_ip_node);
|
||
spin_unlock(&oi->ip_lock);
|
||
kfree(ue);
|
||
}
|
||
}
|
||
|
||
static void ocfs2_free_write_ctxt(struct inode *inode,
|
||
struct ocfs2_write_ctxt *wc)
|
||
{
|
||
ocfs2_free_unwritten_list(inode, &wc->w_unwritten_list);
|
||
ocfs2_unlock_pages(wc);
|
||
brelse(wc->w_di_bh);
|
||
kfree(wc);
|
||
}
|
||
|
||
static int ocfs2_alloc_write_ctxt(struct ocfs2_write_ctxt **wcp,
|
||
struct ocfs2_super *osb, loff_t pos,
|
||
unsigned len, ocfs2_write_type_t type,
|
||
struct buffer_head *di_bh)
|
||
{
|
||
u32 cend;
|
||
struct ocfs2_write_ctxt *wc;
|
||
|
||
wc = kzalloc(sizeof(struct ocfs2_write_ctxt), GFP_NOFS);
|
||
if (!wc)
|
||
return -ENOMEM;
|
||
|
||
wc->w_cpos = pos >> osb->s_clustersize_bits;
|
||
wc->w_first_new_cpos = UINT_MAX;
|
||
cend = (pos + len - 1) >> osb->s_clustersize_bits;
|
||
wc->w_clen = cend - wc->w_cpos + 1;
|
||
get_bh(di_bh);
|
||
wc->w_di_bh = di_bh;
|
||
wc->w_type = type;
|
||
|
||
if (unlikely(PAGE_SHIFT > osb->s_clustersize_bits))
|
||
wc->w_large_pages = 1;
|
||
else
|
||
wc->w_large_pages = 0;
|
||
|
||
ocfs2_init_dealloc_ctxt(&wc->w_dealloc);
|
||
INIT_LIST_HEAD(&wc->w_unwritten_list);
|
||
|
||
*wcp = wc;
|
||
|
||
return 0;
|
||
}
|
||
|
||
/*
|
||
* If a page has any new buffers, zero them out here, and mark them uptodate
|
||
* and dirty so they'll be written out (in order to prevent uninitialised
|
||
* block data from leaking). And clear the new bit.
|
||
*/
|
||
static void ocfs2_zero_new_buffers(struct page *page, unsigned from, unsigned to)
|
||
{
|
||
unsigned int block_start, block_end;
|
||
struct buffer_head *head, *bh;
|
||
|
||
BUG_ON(!PageLocked(page));
|
||
if (!page_has_buffers(page))
|
||
return;
|
||
|
||
bh = head = page_buffers(page);
|
||
block_start = 0;
|
||
do {
|
||
block_end = block_start + bh->b_size;
|
||
|
||
if (buffer_new(bh)) {
|
||
if (block_end > from && block_start < to) {
|
||
if (!PageUptodate(page)) {
|
||
unsigned start, end;
|
||
|
||
start = max(from, block_start);
|
||
end = min(to, block_end);
|
||
|
||
zero_user_segment(page, start, end);
|
||
set_buffer_uptodate(bh);
|
||
}
|
||
|
||
clear_buffer_new(bh);
|
||
mark_buffer_dirty(bh);
|
||
}
|
||
}
|
||
|
||
block_start = block_end;
|
||
bh = bh->b_this_page;
|
||
} while (bh != head);
|
||
}
|
||
|
||
/*
|
||
* Only called when we have a failure during allocating write to write
|
||
* zero's to the newly allocated region.
|
||
*/
|
||
static void ocfs2_write_failure(struct inode *inode,
|
||
struct ocfs2_write_ctxt *wc,
|
||
loff_t user_pos, unsigned user_len)
|
||
{
|
||
int i;
|
||
unsigned from = user_pos & (PAGE_SIZE - 1),
|
||
to = user_pos + user_len;
|
||
struct page *tmppage;
|
||
|
||
if (wc->w_target_page)
|
||
ocfs2_zero_new_buffers(wc->w_target_page, from, to);
|
||
|
||
for(i = 0; i < wc->w_num_pages; i++) {
|
||
tmppage = wc->w_pages[i];
|
||
|
||
if (tmppage && page_has_buffers(tmppage)) {
|
||
if (ocfs2_should_order_data(inode))
|
||
ocfs2_jbd2_inode_add_write(wc->w_handle, inode,
|
||
user_pos, user_len);
|
||
|
||
block_commit_write(tmppage, from, to);
|
||
}
|
||
}
|
||
}
|
||
|
||
static int ocfs2_prepare_page_for_write(struct inode *inode, u64 *p_blkno,
|
||
struct ocfs2_write_ctxt *wc,
|
||
struct page *page, u32 cpos,
|
||
loff_t user_pos, unsigned user_len,
|
||
int new)
|
||
{
|
||
int ret;
|
||
unsigned int map_from = 0, map_to = 0;
|
||
unsigned int cluster_start, cluster_end;
|
||
unsigned int user_data_from = 0, user_data_to = 0;
|
||
|
||
ocfs2_figure_cluster_boundaries(OCFS2_SB(inode->i_sb), cpos,
|
||
&cluster_start, &cluster_end);
|
||
|
||
/* treat the write as new if the a hole/lseek spanned across
|
||
* the page boundary.
|
||
*/
|
||
new = new | ((i_size_read(inode) <= page_offset(page)) &&
|
||
(page_offset(page) <= user_pos));
|
||
|
||
if (page == wc->w_target_page) {
|
||
map_from = user_pos & (PAGE_SIZE - 1);
|
||
map_to = map_from + user_len;
|
||
|
||
if (new)
|
||
ret = ocfs2_map_page_blocks(page, p_blkno, inode,
|
||
cluster_start, cluster_end,
|
||
new);
|
||
else
|
||
ret = ocfs2_map_page_blocks(page, p_blkno, inode,
|
||
map_from, map_to, new);
|
||
if (ret) {
|
||
mlog_errno(ret);
|
||
goto out;
|
||
}
|
||
|
||
user_data_from = map_from;
|
||
user_data_to = map_to;
|
||
if (new) {
|
||
map_from = cluster_start;
|
||
map_to = cluster_end;
|
||
}
|
||
} else {
|
||
/*
|
||
* If we haven't allocated the new page yet, we
|
||
* shouldn't be writing it out without copying user
|
||
* data. This is likely a math error from the caller.
|
||
*/
|
||
BUG_ON(!new);
|
||
|
||
map_from = cluster_start;
|
||
map_to = cluster_end;
|
||
|
||
ret = ocfs2_map_page_blocks(page, p_blkno, inode,
|
||
cluster_start, cluster_end, new);
|
||
if (ret) {
|
||
mlog_errno(ret);
|
||
goto out;
|
||
}
|
||
}
|
||
|
||
/*
|
||
* Parts of newly allocated pages need to be zero'd.
|
||
*
|
||
* Above, we have also rewritten 'to' and 'from' - as far as
|
||
* the rest of the function is concerned, the entire cluster
|
||
* range inside of a page needs to be written.
|
||
*
|
||
* We can skip this if the page is up to date - it's already
|
||
* been zero'd from being read in as a hole.
|
||
*/
|
||
if (new && !PageUptodate(page))
|
||
ocfs2_clear_page_regions(page, OCFS2_SB(inode->i_sb),
|
||
cpos, user_data_from, user_data_to);
|
||
|
||
flush_dcache_page(page);
|
||
|
||
out:
|
||
return ret;
|
||
}
|
||
|
||
/*
|
||
* This function will only grab one clusters worth of pages.
|
||
*/
|
||
static int ocfs2_grab_pages_for_write(struct address_space *mapping,
|
||
struct ocfs2_write_ctxt *wc,
|
||
u32 cpos, loff_t user_pos,
|
||
unsigned user_len, int new,
|
||
struct page *mmap_page)
|
||
{
|
||
int ret = 0, i;
|
||
unsigned long start, target_index, end_index, index;
|
||
struct inode *inode = mapping->host;
|
||
loff_t last_byte;
|
||
|
||
target_index = user_pos >> PAGE_SHIFT;
|
||
|
||
/*
|
||
* Figure out how many pages we'll be manipulating here. For
|
||
* non allocating write, we just change the one
|
||
* page. Otherwise, we'll need a whole clusters worth. If we're
|
||
* writing past i_size, we only need enough pages to cover the
|
||
* last page of the write.
|
||
*/
|
||
if (new) {
|
||
wc->w_num_pages = ocfs2_pages_per_cluster(inode->i_sb);
|
||
start = ocfs2_align_clusters_to_page_index(inode->i_sb, cpos);
|
||
/*
|
||
* We need the index *past* the last page we could possibly
|
||
* touch. This is the page past the end of the write or
|
||
* i_size, whichever is greater.
|
||
*/
|
||
last_byte = max(user_pos + user_len, i_size_read(inode));
|
||
BUG_ON(last_byte < 1);
|
||
end_index = ((last_byte - 1) >> PAGE_SHIFT) + 1;
|
||
if ((start + wc->w_num_pages) > end_index)
|
||
wc->w_num_pages = end_index - start;
|
||
} else {
|
||
wc->w_num_pages = 1;
|
||
start = target_index;
|
||
}
|
||
end_index = (user_pos + user_len - 1) >> PAGE_SHIFT;
|
||
|
||
for(i = 0; i < wc->w_num_pages; i++) {
|
||
index = start + i;
|
||
|
||
if (index >= target_index && index <= end_index &&
|
||
wc->w_type == OCFS2_WRITE_MMAP) {
|
||
/*
|
||
* ocfs2_pagemkwrite() is a little different
|
||
* and wants us to directly use the page
|
||
* passed in.
|
||
*/
|
||
lock_page(mmap_page);
|
||
|
||
/* Exit and let the caller retry */
|
||
if (mmap_page->mapping != mapping) {
|
||
WARN_ON(mmap_page->mapping);
|
||
unlock_page(mmap_page);
|
||
ret = -EAGAIN;
|
||
goto out;
|
||
}
|
||
|
||
get_page(mmap_page);
|
||
wc->w_pages[i] = mmap_page;
|
||
wc->w_target_locked = true;
|
||
} else if (index >= target_index && index <= end_index &&
|
||
wc->w_type == OCFS2_WRITE_DIRECT) {
|
||
/* Direct write has no mapping page. */
|
||
wc->w_pages[i] = NULL;
|
||
continue;
|
||
} else {
|
||
wc->w_pages[i] = find_or_create_page(mapping, index,
|
||
GFP_NOFS);
|
||
if (!wc->w_pages[i]) {
|
||
ret = -ENOMEM;
|
||
mlog_errno(ret);
|
||
goto out;
|
||
}
|
||
}
|
||
wait_for_stable_page(wc->w_pages[i]);
|
||
|
||
if (index == target_index)
|
||
wc->w_target_page = wc->w_pages[i];
|
||
}
|
||
out:
|
||
if (ret)
|
||
wc->w_target_locked = false;
|
||
return ret;
|
||
}
|
||
|
||
/*
|
||
* Prepare a single cluster for write one cluster into the file.
|
||
*/
|
||
static int ocfs2_write_cluster(struct address_space *mapping,
|
||
u32 *phys, unsigned int new,
|
||
unsigned int clear_unwritten,
|
||
unsigned int should_zero,
|
||
struct ocfs2_alloc_context *data_ac,
|
||
struct ocfs2_alloc_context *meta_ac,
|
||
struct ocfs2_write_ctxt *wc, u32 cpos,
|
||
loff_t user_pos, unsigned user_len)
|
||
{
|
||
int ret, i;
|
||
u64 p_blkno;
|
||
struct inode *inode = mapping->host;
|
||
struct ocfs2_extent_tree et;
|
||
int bpc = ocfs2_clusters_to_blocks(inode->i_sb, 1);
|
||
|
||
if (new) {
|
||
u32 tmp_pos;
|
||
|
||
/*
|
||
* This is safe to call with the page locks - it won't take
|
||
* any additional semaphores or cluster locks.
|
||
*/
|
||
tmp_pos = cpos;
|
||
ret = ocfs2_add_inode_data(OCFS2_SB(inode->i_sb), inode,
|
||
&tmp_pos, 1, !clear_unwritten,
|
||
wc->w_di_bh, wc->w_handle,
|
||
data_ac, meta_ac, NULL);
|
||
/*
|
||
* This shouldn't happen because we must have already
|
||
* calculated the correct meta data allocation required. The
|
||
* internal tree allocation code should know how to increase
|
||
* transaction credits itself.
|
||
*
|
||
* If need be, we could handle -EAGAIN for a
|
||
* RESTART_TRANS here.
|
||
*/
|
||
mlog_bug_on_msg(ret == -EAGAIN,
|
||
"Inode %llu: EAGAIN return during allocation.\n",
|
||
(unsigned long long)OCFS2_I(inode)->ip_blkno);
|
||
if (ret < 0) {
|
||
mlog_errno(ret);
|
||
goto out;
|
||
}
|
||
} else if (clear_unwritten) {
|
||
ocfs2_init_dinode_extent_tree(&et, INODE_CACHE(inode),
|
||
wc->w_di_bh);
|
||
ret = ocfs2_mark_extent_written(inode, &et,
|
||
wc->w_handle, cpos, 1, *phys,
|
||
meta_ac, &wc->w_dealloc);
|
||
if (ret < 0) {
|
||
mlog_errno(ret);
|
||
goto out;
|
||
}
|
||
}
|
||
|
||
/*
|
||
* The only reason this should fail is due to an inability to
|
||
* find the extent added.
|
||
*/
|
||
ret = ocfs2_get_clusters(inode, cpos, phys, NULL, NULL);
|
||
if (ret < 0) {
|
||
mlog(ML_ERROR, "Get physical blkno failed for inode %llu, "
|
||
"at logical cluster %u",
|
||
(unsigned long long)OCFS2_I(inode)->ip_blkno, cpos);
|
||
goto out;
|
||
}
|
||
|
||
BUG_ON(*phys == 0);
|
||
|
||
p_blkno = ocfs2_clusters_to_blocks(inode->i_sb, *phys);
|
||
if (!should_zero)
|
||
p_blkno += (user_pos >> inode->i_sb->s_blocksize_bits) & (u64)(bpc - 1);
|
||
|
||
for(i = 0; i < wc->w_num_pages; i++) {
|
||
int tmpret;
|
||
|
||
/* This is the direct io target page. */
|
||
if (wc->w_pages[i] == NULL) {
|
||
p_blkno++;
|
||
continue;
|
||
}
|
||
|
||
tmpret = ocfs2_prepare_page_for_write(inode, &p_blkno, wc,
|
||
wc->w_pages[i], cpos,
|
||
user_pos, user_len,
|
||
should_zero);
|
||
if (tmpret) {
|
||
mlog_errno(tmpret);
|
||
if (ret == 0)
|
||
ret = tmpret;
|
||
}
|
||
}
|
||
|
||
/*
|
||
* We only have cleanup to do in case of allocating write.
|
||
*/
|
||
if (ret && new)
|
||
ocfs2_write_failure(inode, wc, user_pos, user_len);
|
||
|
||
out:
|
||
|
||
return ret;
|
||
}
|
||
|
||
static int ocfs2_write_cluster_by_desc(struct address_space *mapping,
|
||
struct ocfs2_alloc_context *data_ac,
|
||
struct ocfs2_alloc_context *meta_ac,
|
||
struct ocfs2_write_ctxt *wc,
|
||
loff_t pos, unsigned len)
|
||
{
|
||
int ret, i;
|
||
loff_t cluster_off;
|
||
unsigned int local_len = len;
|
||
struct ocfs2_write_cluster_desc *desc;
|
||
struct ocfs2_super *osb = OCFS2_SB(mapping->host->i_sb);
|
||
|
||
for (i = 0; i < wc->w_clen; i++) {
|
||
desc = &wc->w_desc[i];
|
||
|
||
/*
|
||
* We have to make sure that the total write passed in
|
||
* doesn't extend past a single cluster.
|
||
*/
|
||
local_len = len;
|
||
cluster_off = pos & (osb->s_clustersize - 1);
|
||
if ((cluster_off + local_len) > osb->s_clustersize)
|
||
local_len = osb->s_clustersize - cluster_off;
|
||
|
||
ret = ocfs2_write_cluster(mapping, &desc->c_phys,
|
||
desc->c_new,
|
||
desc->c_clear_unwritten,
|
||
desc->c_needs_zero,
|
||
data_ac, meta_ac,
|
||
wc, desc->c_cpos, pos, local_len);
|
||
if (ret) {
|
||
mlog_errno(ret);
|
||
goto out;
|
||
}
|
||
|
||
len -= local_len;
|
||
pos += local_len;
|
||
}
|
||
|
||
ret = 0;
|
||
out:
|
||
return ret;
|
||
}
|
||
|
||
/*
|
||
* ocfs2_write_end() wants to know which parts of the target page it
|
||
* should complete the write on. It's easiest to compute them ahead of
|
||
* time when a more complete view of the write is available.
|
||
*/
|
||
static void ocfs2_set_target_boundaries(struct ocfs2_super *osb,
|
||
struct ocfs2_write_ctxt *wc,
|
||
loff_t pos, unsigned len, int alloc)
|
||
{
|
||
struct ocfs2_write_cluster_desc *desc;
|
||
|
||
wc->w_target_from = pos & (PAGE_SIZE - 1);
|
||
wc->w_target_to = wc->w_target_from + len;
|
||
|
||
if (alloc == 0)
|
||
return;
|
||
|
||
/*
|
||
* Allocating write - we may have different boundaries based
|
||
* on page size and cluster size.
|
||
*
|
||
* NOTE: We can no longer compute one value from the other as
|
||
* the actual write length and user provided length may be
|
||
* different.
|
||
*/
|
||
|
||
if (wc->w_large_pages) {
|
||
/*
|
||
* We only care about the 1st and last cluster within
|
||
* our range and whether they should be zero'd or not. Either
|
||
* value may be extended out to the start/end of a
|
||
* newly allocated cluster.
|
||
*/
|
||
desc = &wc->w_desc[0];
|
||
if (desc->c_needs_zero)
|
||
ocfs2_figure_cluster_boundaries(osb,
|
||
desc->c_cpos,
|
||
&wc->w_target_from,
|
||
NULL);
|
||
|
||
desc = &wc->w_desc[wc->w_clen - 1];
|
||
if (desc->c_needs_zero)
|
||
ocfs2_figure_cluster_boundaries(osb,
|
||
desc->c_cpos,
|
||
NULL,
|
||
&wc->w_target_to);
|
||
} else {
|
||
wc->w_target_from = 0;
|
||
wc->w_target_to = PAGE_SIZE;
|
||
}
|
||
}
|
||
|
||
/*
|
||
* Check if this extent is marked UNWRITTEN by direct io. If so, we need not to
|
||
* do the zero work. And should not to clear UNWRITTEN since it will be cleared
|
||
* by the direct io procedure.
|
||
* If this is a new extent that allocated by direct io, we should mark it in
|
||
* the ip_unwritten_list.
|
||
*/
|
||
static int ocfs2_unwritten_check(struct inode *inode,
|
||
struct ocfs2_write_ctxt *wc,
|
||
struct ocfs2_write_cluster_desc *desc)
|
||
{
|
||
struct ocfs2_inode_info *oi = OCFS2_I(inode);
|
||
struct ocfs2_unwritten_extent *ue = NULL, *new = NULL;
|
||
int ret = 0;
|
||
|
||
if (!desc->c_needs_zero)
|
||
return 0;
|
||
|
||
retry:
|
||
spin_lock(&oi->ip_lock);
|
||
/* Needs not to zero no metter buffer or direct. The one who is zero
|
||
* the cluster is doing zero. And he will clear unwritten after all
|
||
* cluster io finished. */
|
||
list_for_each_entry(ue, &oi->ip_unwritten_list, ue_ip_node) {
|
||
if (desc->c_cpos == ue->ue_cpos) {
|
||
BUG_ON(desc->c_new);
|
||
desc->c_needs_zero = 0;
|
||
desc->c_clear_unwritten = 0;
|
||
goto unlock;
|
||
}
|
||
}
|
||
|
||
if (wc->w_type != OCFS2_WRITE_DIRECT)
|
||
goto unlock;
|
||
|
||
if (new == NULL) {
|
||
spin_unlock(&oi->ip_lock);
|
||
new = kmalloc(sizeof(struct ocfs2_unwritten_extent),
|
||
GFP_NOFS);
|
||
if (new == NULL) {
|
||
ret = -ENOMEM;
|
||
goto out;
|
||
}
|
||
goto retry;
|
||
}
|
||
/* This direct write will doing zero. */
|
||
new->ue_cpos = desc->c_cpos;
|
||
new->ue_phys = desc->c_phys;
|
||
desc->c_clear_unwritten = 0;
|
||
list_add_tail(&new->ue_ip_node, &oi->ip_unwritten_list);
|
||
list_add_tail(&new->ue_node, &wc->w_unwritten_list);
|
||
wc->w_unwritten_count++;
|
||
new = NULL;
|
||
unlock:
|
||
spin_unlock(&oi->ip_lock);
|
||
out:
|
||
kfree(new);
|
||
return ret;
|
||
}
|
||
|
||
/*
|
||
* Populate each single-cluster write descriptor in the write context
|
||
* with information about the i/o to be done.
|
||
*
|
||
* Returns the number of clusters that will have to be allocated, as
|
||
* well as a worst case estimate of the number of extent records that
|
||
* would have to be created during a write to an unwritten region.
|
||
*/
|
||
static int ocfs2_populate_write_desc(struct inode *inode,
|
||
struct ocfs2_write_ctxt *wc,
|
||
unsigned int *clusters_to_alloc,
|
||
unsigned int *extents_to_split)
|
||
{
|
||
int ret;
|
||
struct ocfs2_write_cluster_desc *desc;
|
||
unsigned int num_clusters = 0;
|
||
unsigned int ext_flags = 0;
|
||
u32 phys = 0;
|
||
int i;
|
||
|
||
*clusters_to_alloc = 0;
|
||
*extents_to_split = 0;
|
||
|
||
for (i = 0; i < wc->w_clen; i++) {
|
||
desc = &wc->w_desc[i];
|
||
desc->c_cpos = wc->w_cpos + i;
|
||
|
||
if (num_clusters == 0) {
|
||
/*
|
||
* Need to look up the next extent record.
|
||
*/
|
||
ret = ocfs2_get_clusters(inode, desc->c_cpos, &phys,
|
||
&num_clusters, &ext_flags);
|
||
if (ret) {
|
||
mlog_errno(ret);
|
||
goto out;
|
||
}
|
||
|
||
/* We should already CoW the refcountd extent. */
|
||
BUG_ON(ext_flags & OCFS2_EXT_REFCOUNTED);
|
||
|
||
/*
|
||
* Assume worst case - that we're writing in
|
||
* the middle of the extent.
|
||
*
|
||
* We can assume that the write proceeds from
|
||
* left to right, in which case the extent
|
||
* insert code is smart enough to coalesce the
|
||
* next splits into the previous records created.
|
||
*/
|
||
if (ext_flags & OCFS2_EXT_UNWRITTEN)
|
||
*extents_to_split = *extents_to_split + 2;
|
||
} else if (phys) {
|
||
/*
|
||
* Only increment phys if it doesn't describe
|
||
* a hole.
|
||
*/
|
||
phys++;
|
||
}
|
||
|
||
/*
|
||
* If w_first_new_cpos is < UINT_MAX, we have a non-sparse
|
||
* file that got extended. w_first_new_cpos tells us
|
||
* where the newly allocated clusters are so we can
|
||
* zero them.
|
||
*/
|
||
if (desc->c_cpos >= wc->w_first_new_cpos) {
|
||
BUG_ON(phys == 0);
|
||
desc->c_needs_zero = 1;
|
||
}
|
||
|
||
desc->c_phys = phys;
|
||
if (phys == 0) {
|
||
desc->c_new = 1;
|
||
desc->c_needs_zero = 1;
|
||
desc->c_clear_unwritten = 1;
|
||
*clusters_to_alloc = *clusters_to_alloc + 1;
|
||
}
|
||
|
||
if (ext_flags & OCFS2_EXT_UNWRITTEN) {
|
||
desc->c_clear_unwritten = 1;
|
||
desc->c_needs_zero = 1;
|
||
}
|
||
|
||
ret = ocfs2_unwritten_check(inode, wc, desc);
|
||
if (ret) {
|
||
mlog_errno(ret);
|
||
goto out;
|
||
}
|
||
|
||
num_clusters--;
|
||
}
|
||
|
||
ret = 0;
|
||
out:
|
||
return ret;
|
||
}
|
||
|
||
static int ocfs2_write_begin_inline(struct address_space *mapping,
|
||
struct inode *inode,
|
||
struct ocfs2_write_ctxt *wc)
|
||
{
|
||
int ret;
|
||
struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
|
||
struct page *page;
|
||
handle_t *handle;
|
||
struct ocfs2_dinode *di = (struct ocfs2_dinode *)wc->w_di_bh->b_data;
|
||
|
||
handle = ocfs2_start_trans(osb, OCFS2_INODE_UPDATE_CREDITS);
|
||
if (IS_ERR(handle)) {
|
||
ret = PTR_ERR(handle);
|
||
mlog_errno(ret);
|
||
goto out;
|
||
}
|
||
|
||
page = find_or_create_page(mapping, 0, GFP_NOFS);
|
||
if (!page) {
|
||
ocfs2_commit_trans(osb, handle);
|
||
ret = -ENOMEM;
|
||
mlog_errno(ret);
|
||
goto out;
|
||
}
|
||
/*
|
||
* If we don't set w_num_pages then this page won't get unlocked
|
||
* and freed on cleanup of the write context.
|
||
*/
|
||
wc->w_pages[0] = wc->w_target_page = page;
|
||
wc->w_num_pages = 1;
|
||
|
||
ret = ocfs2_journal_access_di(handle, INODE_CACHE(inode), wc->w_di_bh,
|
||
OCFS2_JOURNAL_ACCESS_WRITE);
|
||
if (ret) {
|
||
ocfs2_commit_trans(osb, handle);
|
||
|
||
mlog_errno(ret);
|
||
goto out;
|
||
}
|
||
|
||
if (!(OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL))
|
||
ocfs2_set_inode_data_inline(inode, di);
|
||
|
||
if (!PageUptodate(page)) {
|
||
ret = ocfs2_read_inline_data(inode, page, wc->w_di_bh);
|
||
if (ret) {
|
||
ocfs2_commit_trans(osb, handle);
|
||
|
||
goto out;
|
||
}
|
||
}
|
||
|
||
wc->w_handle = handle;
|
||
out:
|
||
return ret;
|
||
}
|
||
|
||
int ocfs2_size_fits_inline_data(struct buffer_head *di_bh, u64 new_size)
|
||
{
|
||
struct ocfs2_dinode *di = (struct ocfs2_dinode *)di_bh->b_data;
|
||
|
||
if (new_size <= le16_to_cpu(di->id2.i_data.id_count))
|
||
return 1;
|
||
return 0;
|
||
}
|
||
|
||
static int ocfs2_try_to_write_inline_data(struct address_space *mapping,
|
||
struct inode *inode, loff_t pos,
|
||
unsigned len, struct page *mmap_page,
|
||
struct ocfs2_write_ctxt *wc)
|
||
{
|
||
int ret, written = 0;
|
||
loff_t end = pos + len;
|
||
struct ocfs2_inode_info *oi = OCFS2_I(inode);
|
||
struct ocfs2_dinode *di = NULL;
|
||
|
||
trace_ocfs2_try_to_write_inline_data((unsigned long long)oi->ip_blkno,
|
||
len, (unsigned long long)pos,
|
||
oi->ip_dyn_features);
|
||
|
||
/*
|
||
* Handle inodes which already have inline data 1st.
|
||
*/
|
||
if (oi->ip_dyn_features & OCFS2_INLINE_DATA_FL) {
|
||
if (mmap_page == NULL &&
|
||
ocfs2_size_fits_inline_data(wc->w_di_bh, end))
|
||
goto do_inline_write;
|
||
|
||
/*
|
||
* The write won't fit - we have to give this inode an
|
||
* inline extent list now.
|
||
*/
|
||
ret = ocfs2_convert_inline_data_to_extents(inode, wc->w_di_bh);
|
||
if (ret)
|
||
mlog_errno(ret);
|
||
goto out;
|
||
}
|
||
|
||
/*
|
||
* Check whether the inode can accept inline data.
|
||
*/
|
||
if (oi->ip_clusters != 0 || i_size_read(inode) != 0)
|
||
return 0;
|
||
|
||
/*
|
||
* Check whether the write can fit.
|
||
*/
|
||
di = (struct ocfs2_dinode *)wc->w_di_bh->b_data;
|
||
if (mmap_page ||
|
||
end > ocfs2_max_inline_data_with_xattr(inode->i_sb, di))
|
||
return 0;
|
||
|
||
do_inline_write:
|
||
ret = ocfs2_write_begin_inline(mapping, inode, wc);
|
||
if (ret) {
|
||
mlog_errno(ret);
|
||
goto out;
|
||
}
|
||
|
||
/*
|
||
* This signals to the caller that the data can be written
|
||
* inline.
|
||
*/
|
||
written = 1;
|
||
out:
|
||
return written ? written : ret;
|
||
}
|
||
|
||
/*
|
||
* This function only does anything for file systems which can't
|
||
* handle sparse files.
|
||
*
|
||
* What we want to do here is fill in any hole between the current end
|
||
* of allocation and the end of our write. That way the rest of the
|
||
* write path can treat it as an non-allocating write, which has no
|
||
* special case code for sparse/nonsparse files.
|
||
*/
|
||
static int ocfs2_expand_nonsparse_inode(struct inode *inode,
|
||
struct buffer_head *di_bh,
|
||
loff_t pos, unsigned len,
|
||
struct ocfs2_write_ctxt *wc)
|
||
{
|
||
int ret;
|
||
loff_t newsize = pos + len;
|
||
|
||
BUG_ON(ocfs2_sparse_alloc(OCFS2_SB(inode->i_sb)));
|
||
|
||
if (newsize <= i_size_read(inode))
|
||
return 0;
|
||
|
||
ret = ocfs2_extend_no_holes(inode, di_bh, newsize, pos);
|
||
if (ret)
|
||
mlog_errno(ret);
|
||
|
||
/* There is no wc if this is call from direct. */
|
||
if (wc)
|
||
wc->w_first_new_cpos =
|
||
ocfs2_clusters_for_bytes(inode->i_sb, i_size_read(inode));
|
||
|
||
return ret;
|
||
}
|
||
|
||
static int ocfs2_zero_tail(struct inode *inode, struct buffer_head *di_bh,
|
||
loff_t pos)
|
||
{
|
||
int ret = 0;
|
||
|
||
BUG_ON(!ocfs2_sparse_alloc(OCFS2_SB(inode->i_sb)));
|
||
if (pos > i_size_read(inode))
|
||
ret = ocfs2_zero_extend(inode, di_bh, pos);
|
||
|
||
return ret;
|
||
}
|
||
|
||
int ocfs2_write_begin_nolock(struct address_space *mapping,
|
||
loff_t pos, unsigned len, ocfs2_write_type_t type,
|
||
struct page **pagep, void **fsdata,
|
||
struct buffer_head *di_bh, struct page *mmap_page)
|
||
{
|
||
int ret, cluster_of_pages, credits = OCFS2_INODE_UPDATE_CREDITS;
|
||
unsigned int clusters_to_alloc, extents_to_split, clusters_need = 0;
|
||
struct ocfs2_write_ctxt *wc;
|
||
struct inode *inode = mapping->host;
|
||
struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
|
||
struct ocfs2_dinode *di;
|
||
struct ocfs2_alloc_context *data_ac = NULL;
|
||
struct ocfs2_alloc_context *meta_ac = NULL;
|
||
handle_t *handle;
|
||
struct ocfs2_extent_tree et;
|
||
int try_free = 1, ret1;
|
||
|
||
try_again:
|
||
ret = ocfs2_alloc_write_ctxt(&wc, osb, pos, len, type, di_bh);
|
||
if (ret) {
|
||
mlog_errno(ret);
|
||
return ret;
|
||
}
|
||
|
||
if (ocfs2_supports_inline_data(osb)) {
|
||
ret = ocfs2_try_to_write_inline_data(mapping, inode, pos, len,
|
||
mmap_page, wc);
|
||
if (ret == 1) {
|
||
ret = 0;
|
||
goto success;
|
||
}
|
||
if (ret < 0) {
|
||
mlog_errno(ret);
|
||
goto out;
|
||
}
|
||
}
|
||
|
||
/* Direct io change i_size late, should not zero tail here. */
|
||
if (type != OCFS2_WRITE_DIRECT) {
|
||
if (ocfs2_sparse_alloc(osb))
|
||
ret = ocfs2_zero_tail(inode, di_bh, pos);
|
||
else
|
||
ret = ocfs2_expand_nonsparse_inode(inode, di_bh, pos,
|
||
len, wc);
|
||
if (ret) {
|
||
mlog_errno(ret);
|
||
goto out;
|
||
}
|
||
}
|
||
|
||
ret = ocfs2_check_range_for_refcount(inode, pos, len);
|
||
if (ret < 0) {
|
||
mlog_errno(ret);
|
||
goto out;
|
||
} else if (ret == 1) {
|
||
clusters_need = wc->w_clen;
|
||
ret = ocfs2_refcount_cow(inode, di_bh,
|
||
wc->w_cpos, wc->w_clen, UINT_MAX);
|
||
if (ret) {
|
||
mlog_errno(ret);
|
||
goto out;
|
||
}
|
||
}
|
||
|
||
ret = ocfs2_populate_write_desc(inode, wc, &clusters_to_alloc,
|
||
&extents_to_split);
|
||
if (ret) {
|
||
mlog_errno(ret);
|
||
goto out;
|
||
}
|
||
clusters_need += clusters_to_alloc;
|
||
|
||
di = (struct ocfs2_dinode *)wc->w_di_bh->b_data;
|
||
|
||
trace_ocfs2_write_begin_nolock(
|
||
(unsigned long long)OCFS2_I(inode)->ip_blkno,
|
||
(long long)i_size_read(inode),
|
||
le32_to_cpu(di->i_clusters),
|
||
pos, len, type, mmap_page,
|
||
clusters_to_alloc, extents_to_split);
|
||
|
||
/*
|
||
* We set w_target_from, w_target_to here so that
|
||
* ocfs2_write_end() knows which range in the target page to
|
||
* write out. An allocation requires that we write the entire
|
||
* cluster range.
|
||
*/
|
||
if (clusters_to_alloc || extents_to_split) {
|
||
/*
|
||
* XXX: We are stretching the limits of
|
||
* ocfs2_lock_allocators(). It greatly over-estimates
|
||
* the work to be done.
|
||
*/
|
||
ocfs2_init_dinode_extent_tree(&et, INODE_CACHE(inode),
|
||
wc->w_di_bh);
|
||
ret = ocfs2_lock_allocators(inode, &et,
|
||
clusters_to_alloc, extents_to_split,
|
||
&data_ac, &meta_ac);
|
||
if (ret) {
|
||
mlog_errno(ret);
|
||
goto out;
|
||
}
|
||
|
||
if (data_ac)
|
||
data_ac->ac_resv = &OCFS2_I(inode)->ip_la_data_resv;
|
||
|
||
credits = ocfs2_calc_extend_credits(inode->i_sb,
|
||
&di->id2.i_list);
|
||
} else if (type == OCFS2_WRITE_DIRECT)
|
||
/* direct write needs not to start trans if no extents alloc. */
|
||
goto success;
|
||
|
||
/*
|
||
* We have to zero sparse allocated clusters, unwritten extent clusters,
|
||
* and non-sparse clusters we just extended. For non-sparse writes,
|
||
* we know zeros will only be needed in the first and/or last cluster.
|
||
*/
|
||
if (wc->w_clen && (wc->w_desc[0].c_needs_zero ||
|
||
wc->w_desc[wc->w_clen - 1].c_needs_zero))
|
||
cluster_of_pages = 1;
|
||
else
|
||
cluster_of_pages = 0;
|
||
|
||
ocfs2_set_target_boundaries(osb, wc, pos, len, cluster_of_pages);
|
||
|
||
handle = ocfs2_start_trans(osb, credits);
|
||
if (IS_ERR(handle)) {
|
||
ret = PTR_ERR(handle);
|
||
mlog_errno(ret);
|
||
goto out;
|
||
}
|
||
|
||
wc->w_handle = handle;
|
||
|
||
if (clusters_to_alloc) {
|
||
ret = dquot_alloc_space_nodirty(inode,
|
||
ocfs2_clusters_to_bytes(osb->sb, clusters_to_alloc));
|
||
if (ret)
|
||
goto out_commit;
|
||
}
|
||
|
||
ret = ocfs2_journal_access_di(handle, INODE_CACHE(inode), wc->w_di_bh,
|
||
OCFS2_JOURNAL_ACCESS_WRITE);
|
||
if (ret) {
|
||
mlog_errno(ret);
|
||
goto out_quota;
|
||
}
|
||
|
||
/*
|
||
* Fill our page array first. That way we've grabbed enough so
|
||
* that we can zero and flush if we error after adding the
|
||
* extent.
|
||
*/
|
||
ret = ocfs2_grab_pages_for_write(mapping, wc, wc->w_cpos, pos, len,
|
||
cluster_of_pages, mmap_page);
|
||
if (ret) {
|
||
/*
|
||
* ocfs2_grab_pages_for_write() returns -EAGAIN if it could not lock
|
||
* the target page. In this case, we exit with no error and no target
|
||
* page. This will trigger the caller, page_mkwrite(), to re-try
|
||
* the operation.
|
||
*/
|
||
if (type == OCFS2_WRITE_MMAP && ret == -EAGAIN) {
|
||
BUG_ON(wc->w_target_page);
|
||
ret = 0;
|
||
goto out_quota;
|
||
}
|
||
|
||
mlog_errno(ret);
|
||
goto out_quota;
|
||
}
|
||
|
||
ret = ocfs2_write_cluster_by_desc(mapping, data_ac, meta_ac, wc, pos,
|
||
len);
|
||
if (ret) {
|
||
mlog_errno(ret);
|
||
goto out_quota;
|
||
}
|
||
|
||
if (data_ac)
|
||
ocfs2_free_alloc_context(data_ac);
|
||
if (meta_ac)
|
||
ocfs2_free_alloc_context(meta_ac);
|
||
|
||
success:
|
||
if (pagep)
|
||
*pagep = wc->w_target_page;
|
||
*fsdata = wc;
|
||
return 0;
|
||
out_quota:
|
||
if (clusters_to_alloc)
|
||
dquot_free_space(inode,
|
||
ocfs2_clusters_to_bytes(osb->sb, clusters_to_alloc));
|
||
out_commit:
|
||
ocfs2_commit_trans(osb, handle);
|
||
|
||
out:
|
||
/*
|
||
* The mmapped page won't be unlocked in ocfs2_free_write_ctxt(),
|
||
* even in case of error here like ENOSPC and ENOMEM. So, we need
|
||
* to unlock the target page manually to prevent deadlocks when
|
||
* retrying again on ENOSPC, or when returning non-VM_FAULT_LOCKED
|
||
* to VM code.
|
||
*/
|
||
if (wc->w_target_locked)
|
||
unlock_page(mmap_page);
|
||
|
||
ocfs2_free_write_ctxt(inode, wc);
|
||
|
||
if (data_ac) {
|
||
ocfs2_free_alloc_context(data_ac);
|
||
data_ac = NULL;
|
||
}
|
||
if (meta_ac) {
|
||
ocfs2_free_alloc_context(meta_ac);
|
||
meta_ac = NULL;
|
||
}
|
||
|
||
if (ret == -ENOSPC && try_free) {
|
||
/*
|
||
* Try to free some truncate log so that we can have enough
|
||
* clusters to allocate.
|
||
*/
|
||
try_free = 0;
|
||
|
||
ret1 = ocfs2_try_to_free_truncate_log(osb, clusters_need);
|
||
if (ret1 == 1)
|
||
goto try_again;
|
||
|
||
if (ret1 < 0)
|
||
mlog_errno(ret1);
|
||
}
|
||
|
||
return ret;
|
||
}
|
||
|
||
static int ocfs2_write_begin(struct file *file, struct address_space *mapping,
|
||
loff_t pos, unsigned len,
|
||
struct page **pagep, void **fsdata)
|
||
{
|
||
int ret;
|
||
struct buffer_head *di_bh = NULL;
|
||
struct inode *inode = mapping->host;
|
||
|
||
ret = ocfs2_inode_lock(inode, &di_bh, 1);
|
||
if (ret) {
|
||
mlog_errno(ret);
|
||
return ret;
|
||
}
|
||
|
||
/*
|
||
* Take alloc sem here to prevent concurrent lookups. That way
|
||
* the mapping, zeroing and tree manipulation within
|
||
* ocfs2_write() will be safe against ->read_folio(). This
|
||
* should also serve to lock out allocation from a shared
|
||
* writeable region.
|
||
*/
|
||
down_write(&OCFS2_I(inode)->ip_alloc_sem);
|
||
|
||
ret = ocfs2_write_begin_nolock(mapping, pos, len, OCFS2_WRITE_BUFFER,
|
||
pagep, fsdata, di_bh, NULL);
|
||
if (ret) {
|
||
mlog_errno(ret);
|
||
goto out_fail;
|
||
}
|
||
|
||
brelse(di_bh);
|
||
|
||
return 0;
|
||
|
||
out_fail:
|
||
up_write(&OCFS2_I(inode)->ip_alloc_sem);
|
||
|
||
brelse(di_bh);
|
||
ocfs2_inode_unlock(inode, 1);
|
||
|
||
return ret;
|
||
}
|
||
|
||
static void ocfs2_write_end_inline(struct inode *inode, loff_t pos,
|
||
unsigned len, unsigned *copied,
|
||
struct ocfs2_dinode *di,
|
||
struct ocfs2_write_ctxt *wc)
|
||
{
|
||
void *kaddr;
|
||
|
||
if (unlikely(*copied < len)) {
|
||
if (!PageUptodate(wc->w_target_page)) {
|
||
*copied = 0;
|
||
return;
|
||
}
|
||
}
|
||
|
||
kaddr = kmap_atomic(wc->w_target_page);
|
||
memcpy(di->id2.i_data.id_data + pos, kaddr + pos, *copied);
|
||
kunmap_atomic(kaddr);
|
||
|
||
trace_ocfs2_write_end_inline(
|
||
(unsigned long long)OCFS2_I(inode)->ip_blkno,
|
||
(unsigned long long)pos, *copied,
|
||
le16_to_cpu(di->id2.i_data.id_count),
|
||
le16_to_cpu(di->i_dyn_features));
|
||
}
|
||
|
||
int ocfs2_write_end_nolock(struct address_space *mapping,
|
||
loff_t pos, unsigned len, unsigned copied, void *fsdata)
|
||
{
|
||
int i, ret;
|
||
unsigned from, to, start = pos & (PAGE_SIZE - 1);
|
||
struct inode *inode = mapping->host;
|
||
struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
|
||
struct ocfs2_write_ctxt *wc = fsdata;
|
||
struct ocfs2_dinode *di = (struct ocfs2_dinode *)wc->w_di_bh->b_data;
|
||
handle_t *handle = wc->w_handle;
|
||
struct page *tmppage;
|
||
|
||
BUG_ON(!list_empty(&wc->w_unwritten_list));
|
||
|
||
if (handle) {
|
||
ret = ocfs2_journal_access_di(handle, INODE_CACHE(inode),
|
||
wc->w_di_bh, OCFS2_JOURNAL_ACCESS_WRITE);
|
||
if (ret) {
|
||
copied = ret;
|
||
mlog_errno(ret);
|
||
goto out;
|
||
}
|
||
}
|
||
|
||
if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) {
|
||
ocfs2_write_end_inline(inode, pos, len, &copied, di, wc);
|
||
goto out_write_size;
|
||
}
|
||
|
||
if (unlikely(copied < len) && wc->w_target_page) {
|
||
loff_t new_isize;
|
||
|
||
if (!PageUptodate(wc->w_target_page))
|
||
copied = 0;
|
||
|
||
new_isize = max_t(loff_t, i_size_read(inode), pos + copied);
|
||
if (new_isize > page_offset(wc->w_target_page))
|
||
ocfs2_zero_new_buffers(wc->w_target_page, start+copied,
|
||
start+len);
|
||
else {
|
||
/*
|
||
* When page is fully beyond new isize (data copy
|
||
* failed), do not bother zeroing the page. Invalidate
|
||
* it instead so that writeback does not get confused
|
||
* put page & buffer dirty bits into inconsistent
|
||
* state.
|
||
*/
|
||
block_invalidate_folio(page_folio(wc->w_target_page),
|
||
0, PAGE_SIZE);
|
||
}
|
||
}
|
||
if (wc->w_target_page)
|
||
flush_dcache_page(wc->w_target_page);
|
||
|
||
for(i = 0; i < wc->w_num_pages; i++) {
|
||
tmppage = wc->w_pages[i];
|
||
|
||
/* This is the direct io target page. */
|
||
if (tmppage == NULL)
|
||
continue;
|
||
|
||
if (tmppage == wc->w_target_page) {
|
||
from = wc->w_target_from;
|
||
to = wc->w_target_to;
|
||
|
||
BUG_ON(from > PAGE_SIZE ||
|
||
to > PAGE_SIZE ||
|
||
to < from);
|
||
} else {
|
||
/*
|
||
* Pages adjacent to the target (if any) imply
|
||
* a hole-filling write in which case we want
|
||
* to flush their entire range.
|
||
*/
|
||
from = 0;
|
||
to = PAGE_SIZE;
|
||
}
|
||
|
||
if (page_has_buffers(tmppage)) {
|
||
if (handle && ocfs2_should_order_data(inode)) {
|
||
loff_t start_byte =
|
||
((loff_t)tmppage->index << PAGE_SHIFT) +
|
||
from;
|
||
loff_t length = to - from;
|
||
ocfs2_jbd2_inode_add_write(handle, inode,
|
||
start_byte, length);
|
||
}
|
||
block_commit_write(tmppage, from, to);
|
||
}
|
||
}
|
||
|
||
out_write_size:
|
||
/* Direct io do not update i_size here. */
|
||
if (wc->w_type != OCFS2_WRITE_DIRECT) {
|
||
pos += copied;
|
||
if (pos > i_size_read(inode)) {
|
||
i_size_write(inode, pos);
|
||
mark_inode_dirty(inode);
|
||
}
|
||
inode->i_blocks = ocfs2_inode_sector_count(inode);
|
||
di->i_size = cpu_to_le64((u64)i_size_read(inode));
|
||
inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
|
||
di->i_mtime = di->i_ctime = cpu_to_le64(inode_get_mtime_sec(inode));
|
||
di->i_mtime_nsec = di->i_ctime_nsec = cpu_to_le32(inode_get_mtime_nsec(inode));
|
||
if (handle)
|
||
ocfs2_update_inode_fsync_trans(handle, inode, 1);
|
||
}
|
||
if (handle)
|
||
ocfs2_journal_dirty(handle, wc->w_di_bh);
|
||
|
||
out:
|
||
/* unlock pages before dealloc since it needs acquiring j_trans_barrier
|
||
* lock, or it will cause a deadlock since journal commit threads holds
|
||
* this lock and will ask for the page lock when flushing the data.
|
||
* put it here to preserve the unlock order.
|
||
*/
|
||
ocfs2_unlock_pages(wc);
|
||
|
||
if (handle)
|
||
ocfs2_commit_trans(osb, handle);
|
||
|
||
ocfs2_run_deallocs(osb, &wc->w_dealloc);
|
||
|
||
brelse(wc->w_di_bh);
|
||
kfree(wc);
|
||
|
||
return copied;
|
||
}
|
||
|
||
static int ocfs2_write_end(struct file *file, struct address_space *mapping,
|
||
loff_t pos, unsigned len, unsigned copied,
|
||
struct page *page, void *fsdata)
|
||
{
|
||
int ret;
|
||
struct inode *inode = mapping->host;
|
||
|
||
ret = ocfs2_write_end_nolock(mapping, pos, len, copied, fsdata);
|
||
|
||
up_write(&OCFS2_I(inode)->ip_alloc_sem);
|
||
ocfs2_inode_unlock(inode, 1);
|
||
|
||
return ret;
|
||
}
|
||
|
||
struct ocfs2_dio_write_ctxt {
|
||
struct list_head dw_zero_list;
|
||
unsigned dw_zero_count;
|
||
int dw_orphaned;
|
||
pid_t dw_writer_pid;
|
||
};
|
||
|
||
static struct ocfs2_dio_write_ctxt *
|
||
ocfs2_dio_alloc_write_ctx(struct buffer_head *bh, int *alloc)
|
||
{
|
||
struct ocfs2_dio_write_ctxt *dwc = NULL;
|
||
|
||
if (bh->b_private)
|
||
return bh->b_private;
|
||
|
||
dwc = kmalloc(sizeof(struct ocfs2_dio_write_ctxt), GFP_NOFS);
|
||
if (dwc == NULL)
|
||
return NULL;
|
||
INIT_LIST_HEAD(&dwc->dw_zero_list);
|
||
dwc->dw_zero_count = 0;
|
||
dwc->dw_orphaned = 0;
|
||
dwc->dw_writer_pid = task_pid_nr(current);
|
||
bh->b_private = dwc;
|
||
*alloc = 1;
|
||
|
||
return dwc;
|
||
}
|
||
|
||
static void ocfs2_dio_free_write_ctx(struct inode *inode,
|
||
struct ocfs2_dio_write_ctxt *dwc)
|
||
{
|
||
ocfs2_free_unwritten_list(inode, &dwc->dw_zero_list);
|
||
kfree(dwc);
|
||
}
|
||
|
||
/*
|
||
* TODO: Make this into a generic get_blocks function.
|
||
*
|
||
* From do_direct_io in direct-io.c:
|
||
* "So what we do is to permit the ->get_blocks function to populate
|
||
* bh.b_size with the size of IO which is permitted at this offset and
|
||
* this i_blkbits."
|
||
*
|
||
* This function is called directly from get_more_blocks in direct-io.c.
|
||
*
|
||
* called like this: dio->get_blocks(dio->inode, fs_startblk,
|
||
* fs_count, map_bh, dio->rw == WRITE);
|
||
*/
|
||
static int ocfs2_dio_wr_get_block(struct inode *inode, sector_t iblock,
|
||
struct buffer_head *bh_result, int create)
|
||
{
|
||
struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
|
||
struct ocfs2_inode_info *oi = OCFS2_I(inode);
|
||
struct ocfs2_write_ctxt *wc;
|
||
struct ocfs2_write_cluster_desc *desc = NULL;
|
||
struct ocfs2_dio_write_ctxt *dwc = NULL;
|
||
struct buffer_head *di_bh = NULL;
|
||
u64 p_blkno;
|
||
unsigned int i_blkbits = inode->i_sb->s_blocksize_bits;
|
||
loff_t pos = iblock << i_blkbits;
|
||
sector_t endblk = (i_size_read(inode) - 1) >> i_blkbits;
|
||
unsigned len, total_len = bh_result->b_size;
|
||
int ret = 0, first_get_block = 0;
|
||
|
||
len = osb->s_clustersize - (pos & (osb->s_clustersize - 1));
|
||
len = min(total_len, len);
|
||
|
||
/*
|
||
* bh_result->b_size is count in get_more_blocks according to write
|
||
* "pos" and "end", we need map twice to return different buffer state:
|
||
* 1. area in file size, not set NEW;
|
||
* 2. area out file size, set NEW.
|
||
*
|
||
* iblock endblk
|
||
* |--------|---------|---------|---------
|
||
* |<-------area in file------->|
|
||
*/
|
||
|
||
if ((iblock <= endblk) &&
|
||
((iblock + ((len - 1) >> i_blkbits)) > endblk))
|
||
len = (endblk - iblock + 1) << i_blkbits;
|
||
|
||
mlog(0, "get block of %lu at %llu:%u req %u\n",
|
||
inode->i_ino, pos, len, total_len);
|
||
|
||
/*
|
||
* Because we need to change file size in ocfs2_dio_end_io_write(), or
|
||
* we may need to add it to orphan dir. So can not fall to fast path
|
||
* while file size will be changed.
|
||
*/
|
||
if (pos + total_len <= i_size_read(inode)) {
|
||
|
||
/* This is the fast path for re-write. */
|
||
ret = ocfs2_lock_get_block(inode, iblock, bh_result, create);
|
||
if (buffer_mapped(bh_result) &&
|
||
!buffer_new(bh_result) &&
|
||
ret == 0)
|
||
goto out;
|
||
|
||
/* Clear state set by ocfs2_get_block. */
|
||
bh_result->b_state = 0;
|
||
}
|
||
|
||
dwc = ocfs2_dio_alloc_write_ctx(bh_result, &first_get_block);
|
||
if (unlikely(dwc == NULL)) {
|
||
ret = -ENOMEM;
|
||
mlog_errno(ret);
|
||
goto out;
|
||
}
|
||
|
||
if (ocfs2_clusters_for_bytes(inode->i_sb, pos + total_len) >
|
||
ocfs2_clusters_for_bytes(inode->i_sb, i_size_read(inode)) &&
|
||
!dwc->dw_orphaned) {
|
||
/*
|
||
* when we are going to alloc extents beyond file size, add the
|
||
* inode to orphan dir, so we can recall those spaces when
|
||
* system crashed during write.
|
||
*/
|
||
ret = ocfs2_add_inode_to_orphan(osb, inode);
|
||
if (ret < 0) {
|
||
mlog_errno(ret);
|
||
goto out;
|
||
}
|
||
dwc->dw_orphaned = 1;
|
||
}
|
||
|
||
ret = ocfs2_inode_lock(inode, &di_bh, 1);
|
||
if (ret) {
|
||
mlog_errno(ret);
|
||
goto out;
|
||
}
|
||
|
||
down_write(&oi->ip_alloc_sem);
|
||
|
||
if (first_get_block) {
|
||
if (ocfs2_sparse_alloc(osb))
|
||
ret = ocfs2_zero_tail(inode, di_bh, pos);
|
||
else
|
||
ret = ocfs2_expand_nonsparse_inode(inode, di_bh, pos,
|
||
total_len, NULL);
|
||
if (ret < 0) {
|
||
mlog_errno(ret);
|
||
goto unlock;
|
||
}
|
||
}
|
||
|
||
ret = ocfs2_write_begin_nolock(inode->i_mapping, pos, len,
|
||
OCFS2_WRITE_DIRECT, NULL,
|
||
(void **)&wc, di_bh, NULL);
|
||
if (ret) {
|
||
mlog_errno(ret);
|
||
goto unlock;
|
||
}
|
||
|
||
desc = &wc->w_desc[0];
|
||
|
||
p_blkno = ocfs2_clusters_to_blocks(inode->i_sb, desc->c_phys);
|
||
BUG_ON(p_blkno == 0);
|
||
p_blkno += iblock & (u64)(ocfs2_clusters_to_blocks(inode->i_sb, 1) - 1);
|
||
|
||
map_bh(bh_result, inode->i_sb, p_blkno);
|
||
bh_result->b_size = len;
|
||
if (desc->c_needs_zero)
|
||
set_buffer_new(bh_result);
|
||
|
||
if (iblock > endblk)
|
||
set_buffer_new(bh_result);
|
||
|
||
/* May sleep in end_io. It should not happen in a irq context. So defer
|
||
* it to dio work queue. */
|
||
set_buffer_defer_completion(bh_result);
|
||
|
||
if (!list_empty(&wc->w_unwritten_list)) {
|
||
struct ocfs2_unwritten_extent *ue = NULL;
|
||
|
||
ue = list_first_entry(&wc->w_unwritten_list,
|
||
struct ocfs2_unwritten_extent,
|
||
ue_node);
|
||
BUG_ON(ue->ue_cpos != desc->c_cpos);
|
||
/* The physical address may be 0, fill it. */
|
||
ue->ue_phys = desc->c_phys;
|
||
|
||
list_splice_tail_init(&wc->w_unwritten_list, &dwc->dw_zero_list);
|
||
dwc->dw_zero_count += wc->w_unwritten_count;
|
||
}
|
||
|
||
ret = ocfs2_write_end_nolock(inode->i_mapping, pos, len, len, wc);
|
||
BUG_ON(ret != len);
|
||
ret = 0;
|
||
unlock:
|
||
up_write(&oi->ip_alloc_sem);
|
||
ocfs2_inode_unlock(inode, 1);
|
||
brelse(di_bh);
|
||
out:
|
||
if (ret < 0)
|
||
ret = -EIO;
|
||
return ret;
|
||
}
|
||
|
||
static int ocfs2_dio_end_io_write(struct inode *inode,
|
||
struct ocfs2_dio_write_ctxt *dwc,
|
||
loff_t offset,
|
||
ssize_t bytes)
|
||
{
|
||
struct ocfs2_cached_dealloc_ctxt dealloc;
|
||
struct ocfs2_extent_tree et;
|
||
struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
|
||
struct ocfs2_inode_info *oi = OCFS2_I(inode);
|
||
struct ocfs2_unwritten_extent *ue = NULL;
|
||
struct buffer_head *di_bh = NULL;
|
||
struct ocfs2_dinode *di;
|
||
struct ocfs2_alloc_context *data_ac = NULL;
|
||
struct ocfs2_alloc_context *meta_ac = NULL;
|
||
handle_t *handle = NULL;
|
||
loff_t end = offset + bytes;
|
||
int ret = 0, credits = 0;
|
||
|
||
ocfs2_init_dealloc_ctxt(&dealloc);
|
||
|
||
/* We do clear unwritten, delete orphan, change i_size here. If neither
|
||
* of these happen, we can skip all this. */
|
||
if (list_empty(&dwc->dw_zero_list) &&
|
||
end <= i_size_read(inode) &&
|
||
!dwc->dw_orphaned)
|
||
goto out;
|
||
|
||
ret = ocfs2_inode_lock(inode, &di_bh, 1);
|
||
if (ret < 0) {
|
||
mlog_errno(ret);
|
||
goto out;
|
||
}
|
||
|
||
down_write(&oi->ip_alloc_sem);
|
||
|
||
/* Delete orphan before acquire i_rwsem. */
|
||
if (dwc->dw_orphaned) {
|
||
BUG_ON(dwc->dw_writer_pid != task_pid_nr(current));
|
||
|
||
end = end > i_size_read(inode) ? end : 0;
|
||
|
||
ret = ocfs2_del_inode_from_orphan(osb, inode, di_bh,
|
||
!!end, end);
|
||
if (ret < 0)
|
||
mlog_errno(ret);
|
||
}
|
||
|
||
di = (struct ocfs2_dinode *)di_bh->b_data;
|
||
|
||
ocfs2_init_dinode_extent_tree(&et, INODE_CACHE(inode), di_bh);
|
||
|
||
/* Attach dealloc with extent tree in case that we may reuse extents
|
||
* which are already unlinked from current extent tree due to extent
|
||
* rotation and merging.
|
||
*/
|
||
et.et_dealloc = &dealloc;
|
||
|
||
ret = ocfs2_lock_allocators(inode, &et, 0, dwc->dw_zero_count*2,
|
||
&data_ac, &meta_ac);
|
||
if (ret) {
|
||
mlog_errno(ret);
|
||
goto unlock;
|
||
}
|
||
|
||
credits = ocfs2_calc_extend_credits(inode->i_sb, &di->id2.i_list);
|
||
|
||
handle = ocfs2_start_trans(osb, credits);
|
||
if (IS_ERR(handle)) {
|
||
ret = PTR_ERR(handle);
|
||
mlog_errno(ret);
|
||
goto unlock;
|
||
}
|
||
ret = ocfs2_journal_access_di(handle, INODE_CACHE(inode), di_bh,
|
||
OCFS2_JOURNAL_ACCESS_WRITE);
|
||
if (ret) {
|
||
mlog_errno(ret);
|
||
goto commit;
|
||
}
|
||
|
||
list_for_each_entry(ue, &dwc->dw_zero_list, ue_node) {
|
||
ret = ocfs2_mark_extent_written(inode, &et, handle,
|
||
ue->ue_cpos, 1,
|
||
ue->ue_phys,
|
||
meta_ac, &dealloc);
|
||
if (ret < 0) {
|
||
mlog_errno(ret);
|
||
break;
|
||
}
|
||
}
|
||
|
||
if (end > i_size_read(inode)) {
|
||
ret = ocfs2_set_inode_size(handle, inode, di_bh, end);
|
||
if (ret < 0)
|
||
mlog_errno(ret);
|
||
}
|
||
commit:
|
||
ocfs2_commit_trans(osb, handle);
|
||
unlock:
|
||
up_write(&oi->ip_alloc_sem);
|
||
ocfs2_inode_unlock(inode, 1);
|
||
brelse(di_bh);
|
||
out:
|
||
if (data_ac)
|
||
ocfs2_free_alloc_context(data_ac);
|
||
if (meta_ac)
|
||
ocfs2_free_alloc_context(meta_ac);
|
||
ocfs2_run_deallocs(osb, &dealloc);
|
||
ocfs2_dio_free_write_ctx(inode, dwc);
|
||
|
||
return ret;
|
||
}
|
||
|
||
/*
|
||
* ocfs2_dio_end_io is called by the dio core when a dio is finished. We're
|
||
* particularly interested in the aio/dio case. We use the rw_lock DLM lock
|
||
* to protect io on one node from truncation on another.
|
||
*/
|
||
static int ocfs2_dio_end_io(struct kiocb *iocb,
|
||
loff_t offset,
|
||
ssize_t bytes,
|
||
void *private)
|
||
{
|
||
struct inode *inode = file_inode(iocb->ki_filp);
|
||
int level;
|
||
int ret = 0;
|
||
|
||
/* this io's submitter should not have unlocked this before we could */
|
||
BUG_ON(!ocfs2_iocb_is_rw_locked(iocb));
|
||
|
||
if (bytes <= 0)
|
||
mlog_ratelimited(ML_ERROR, "Direct IO failed, bytes = %lld",
|
||
(long long)bytes);
|
||
if (private) {
|
||
if (bytes > 0)
|
||
ret = ocfs2_dio_end_io_write(inode, private, offset,
|
||
bytes);
|
||
else
|
||
ocfs2_dio_free_write_ctx(inode, private);
|
||
}
|
||
|
||
ocfs2_iocb_clear_rw_locked(iocb);
|
||
|
||
level = ocfs2_iocb_rw_locked_level(iocb);
|
||
ocfs2_rw_unlock(inode, level);
|
||
return ret;
|
||
}
|
||
|
||
static ssize_t ocfs2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
|
||
{
|
||
struct file *file = iocb->ki_filp;
|
||
struct inode *inode = file->f_mapping->host;
|
||
struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
|
||
get_block_t *get_block;
|
||
|
||
/*
|
||
* Fallback to buffered I/O if we see an inode without
|
||
* extents.
|
||
*/
|
||
if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL)
|
||
return 0;
|
||
|
||
/* Fallback to buffered I/O if we do not support append dio. */
|
||
if (iocb->ki_pos + iter->count > i_size_read(inode) &&
|
||
!ocfs2_supports_append_dio(osb))
|
||
return 0;
|
||
|
||
if (iov_iter_rw(iter) == READ)
|
||
get_block = ocfs2_lock_get_block;
|
||
else
|
||
get_block = ocfs2_dio_wr_get_block;
|
||
|
||
return __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev,
|
||
iter, get_block,
|
||
ocfs2_dio_end_io, 0);
|
||
}
|
||
|
||
const struct address_space_operations ocfs2_aops = {
|
||
.dirty_folio = block_dirty_folio,
|
||
.read_folio = ocfs2_read_folio,
|
||
.readahead = ocfs2_readahead,
|
||
.writepage = ocfs2_writepage,
|
||
.write_begin = ocfs2_write_begin,
|
||
.write_end = ocfs2_write_end,
|
||
.bmap = ocfs2_bmap,
|
||
.direct_IO = ocfs2_direct_IO,
|
||
.invalidate_folio = block_invalidate_folio,
|
||
.release_folio = ocfs2_release_folio,
|
||
.migrate_folio = buffer_migrate_folio,
|
||
.is_partially_uptodate = block_is_partially_uptodate,
|
||
.error_remove_page = generic_error_remove_page,
|
||
};
|