mirror of
https://github.com/coreutils/coreutils.git
synced 2025-01-22 16:13:34 +08:00
4044 lines
134 KiB
Plaintext
4044 lines
134 KiB
Plaintext
\input texinfo
|
|
@c %**start of header
|
|
@setfilename textutils.info
|
|
@settitle GNU text utilities
|
|
@c %**end of header
|
|
|
|
@include version.texi
|
|
|
|
@c Define new indices.
|
|
@defcodeindex op
|
|
|
|
@c Put everything in one index (arbitrarily chosen to be the concept index).
|
|
@syncodeindex fn cp
|
|
@syncodeindex ky cp
|
|
@syncodeindex op cp
|
|
@syncodeindex pg cp
|
|
@syncodeindex vr cp
|
|
|
|
@ifinfo
|
|
@format
|
|
START-INFO-DIR-ENTRY
|
|
* Text utilities: (textutils). GNU text utilities.
|
|
* cat: (textutils)cat invocation. Concatenate and write files.
|
|
* cksum: (textutils)cksum invocation. Print @sc{POSIX} CRC checksum.
|
|
* comm: (textutils)comm invocation. Compare sorted files by line.
|
|
* csplit: (textutils)csplit invocation. Split by context.
|
|
* cut: (textutils)cut invocation. Print selected parts of lines.
|
|
* expand: (textutils)expand invocation. Convert tabs to spaces.
|
|
* fmt: (textutils)fmt invocation. Reformat paragraph text.
|
|
* fold: (textutils)fold invocation. Wrap long input lines.
|
|
* head: (textutils)head invocation. Output the first part of files.
|
|
* join: (textutils)join invocation. Join lines on a common field.
|
|
* md5sum: (textutils)md5sum invocation. Print or check message-digests.
|
|
* nl: (textutils)nl invocation. Number lines and write files.
|
|
* od: (textutils)od invocation. Dump files in octal, etc.
|
|
* paste: (textutils)paste invocation. Merge lines of files.
|
|
* pr: (textutils)pr invocation. Paginate or columnate files.
|
|
* ptx: (textutils)ptx invocation. Produce permuted indexes.
|
|
* sort: (textutils)sort invocation. Sort text files.
|
|
* split: (textutils)split invocation. Split into fixed-size pieces.
|
|
* sum: (textutils)sum invocation. Print traditional checksum.
|
|
* tac: (textutils)tac invocation. Reverse files.
|
|
* tail: (textutils)tail invocation. Output the last part of files.
|
|
* tr: (textutils)tr invocation. Translate characters.
|
|
* unexpand: (textutils)unexpand invocation. Convert spaces to tabs.
|
|
* uniq: (textutils)uniq invocation. Uniqify files.
|
|
* wc: (textutils)wc invocation. Byte, word, and line counts.
|
|
END-INFO-DIR-ENTRY
|
|
@end format
|
|
@end ifinfo
|
|
|
|
@ifinfo
|
|
This file documents the GNU text utilities.
|
|
|
|
Copyright (C) 1994, 95, 96 Free Software Foundation, Inc.
|
|
|
|
Permission is granted to make and distribute verbatim copies of
|
|
this manual provided the copyright notice and this permission notice
|
|
are preserved on all copies.
|
|
|
|
@ignore
|
|
Permission is granted to process this file through TeX and print the
|
|
results, provided the printed document carries copying permission
|
|
notice identical to this one except for the removal of this paragraph
|
|
(this paragraph not being relevant to the printed manual).
|
|
|
|
@end ignore
|
|
Permission is granted to copy and distribute modified versions of this
|
|
manual under the conditions for verbatim copying, provided that the entire
|
|
resulting derived work is distributed under the terms of a permission
|
|
notice identical to this one.
|
|
|
|
Permission is granted to copy and distribute translations of this manual
|
|
into another language, under the above conditions for modified versions,
|
|
except that this permission notice may be stated in a translation approved
|
|
by the Foundation.
|
|
@end ifinfo
|
|
|
|
@titlepage
|
|
@title GNU @code{textutils}
|
|
@subtitle A set of text utilities
|
|
@subtitle for version @value{VERSION}, @value{UPDATED}
|
|
@author David MacKenzie et al.
|
|
|
|
@page
|
|
@vskip 0pt plus 1filll
|
|
Copyright @copyright{} 1994, 95, 96 Free Software Foundation, Inc.
|
|
|
|
Permission is granted to make and distribute verbatim copies of
|
|
this manual provided the copyright notice and this permission notice
|
|
are preserved on all copies.
|
|
|
|
Permission is granted to copy and distribute modified versions of this
|
|
manual under the conditions for verbatim copying, provided that the entire
|
|
resulting derived work is distributed under the terms of a permission
|
|
notice identical to this one.
|
|
|
|
Permission is granted to copy and distribute translations of this manual
|
|
into another language, under the above conditions for modified versions,
|
|
except that this permission notice may be stated in a translation approved
|
|
by the Foundation.
|
|
@end titlepage
|
|
|
|
|
|
@ifinfo
|
|
@node Top
|
|
@top GNU text utilities
|
|
|
|
@cindex text utilities
|
|
@cindex utilities for text handling
|
|
|
|
This manual documents version @value{VERSION} of the GNU text utilities.
|
|
|
|
@menu
|
|
* Introduction:: Caveats, overview, and authors.
|
|
* Common options:: Common options.
|
|
* Output of entire files:: cat tac nl od
|
|
* Formatting file contents:: fmt pr fold
|
|
* Output of parts of files:: head tail split csplit
|
|
* Summarizing files:: wc sum cksum md5sum
|
|
* Operating on sorted files:: sort uniq comm ptx
|
|
* Operating on fields within a line:: cut paste join
|
|
* Operating on characters:: tr expand unexpand
|
|
* Opening the software toolbox:: The software tools philosophy.
|
|
* Index:: General index.
|
|
|
|
@detailmenu
|
|
--- The Detailed Node Listing ---
|
|
|
|
Output of entire files
|
|
|
|
* cat invocation:: Concatenate and write files.
|
|
* tac invocation:: Concatenate and write files in reverse.
|
|
* nl invocation:: Number lines and write files.
|
|
* od invocation:: Write files in octal or other formats.
|
|
|
|
Formatting file contents
|
|
|
|
* fmt invocation:: Reformat paragraph text.
|
|
* pr invocation:: Paginate or columnate files for printing.
|
|
* fold invocation:: Wrap input lines to fit in specified width.
|
|
|
|
Output of parts of files
|
|
|
|
* head invocation:: Output the first part of files.
|
|
* tail invocation:: Output the last part of files.
|
|
* split invocation:: Split a file into fixed-size pieces.
|
|
* csplit invocation:: Split a file into context-determined pieces.
|
|
|
|
Summarizing files
|
|
|
|
* wc invocation:: Print byte, word, and line counts.
|
|
* sum invocation:: Print checksum and block counts.
|
|
* cksum invocation:: Print CRC checksum and byte counts.
|
|
* md5sum invocation:: Print or check message-digests.
|
|
|
|
Operating on sorted files
|
|
|
|
* sort invocation:: Sort text files.
|
|
* uniq invocation:: Uniqify files.
|
|
* comm invocation:: Compare two sorted files line by line.
|
|
* ptx invocation:: Produce a permuted index of file contents.
|
|
|
|
@code{ptx}: Produce permuted indexes
|
|
|
|
* General options in ptx:: Options which affect general program behaviour.
|
|
* Charset selection in ptx:: Underlying character set considerations.
|
|
* Input processing in ptx:: Input fields, contexts, and keyword selection.
|
|
* Output formatting in ptx:: Types of output format, and sizing the fields.
|
|
* Compatibility in ptx:: The GNU extensions to @code{ptx}
|
|
|
|
Operating on fields within a line
|
|
|
|
* cut invocation:: Print selected parts of lines.
|
|
* paste invocation:: Merge lines of files.
|
|
* join invocation:: Join lines on a common field.
|
|
|
|
Operating on characters
|
|
|
|
* tr invocation:: Translate, squeeze, and/or delete characters.
|
|
* expand invocation:: Convert tabs to spaces.
|
|
* unexpand invocation:: Convert spaces to tabs.
|
|
|
|
@code{tr}: Translate, squeeze, and/or delete characters
|
|
|
|
* Character sets:: Specifying sets of characters.
|
|
* Translating:: Changing one characters to another.
|
|
* Squeezing:: Squeezing repeats and deleting.
|
|
* Warnings in tr:: Warning messages.
|
|
|
|
Opening the software toolbox
|
|
|
|
* Toolbox introduction:: Toolbox introduction
|
|
* I/O redirection:: I/O redirection
|
|
* The who command:: The @code{who} command
|
|
* The cut command:: The @code{cut} command
|
|
* The sort command:: The @code{sort} command
|
|
* The uniq command:: The @code{uniq} command
|
|
* Putting the tools together:: Putting the tools together
|
|
|
|
@end detailmenu
|
|
@end menu
|
|
|
|
@end ifinfo
|
|
|
|
|
|
@node Introduction
|
|
@chapter Introduction
|
|
|
|
@cindex introduction
|
|
|
|
This manual is incomplete: No attempt is made to explain basic concepts
|
|
in a way suitable for novices. Thus, if you are interested, please get
|
|
involved in improving this manual. The entire GNU community will
|
|
benefit.
|
|
|
|
@cindex POSIX.2
|
|
The GNU text utilities are mostly compatible with the @sc{POSIX.2} standard.
|
|
|
|
@c This paragraph appears in all of fileutils.texi, textutils.texi, and
|
|
@c sh-utils.texi too -- so be sure to keep them consistent.
|
|
@cindex bugs, reporting
|
|
Please report bugs to @email{bug-textutils@@gnu.org}. Remember
|
|
to include the version number, machine architecture, input files, and
|
|
any other information needed to reproduce the bug: your input, what you
|
|
expected, what you got, and why it is wrong. Diffs are welcome, but
|
|
please include a description of the problem as well, since this is
|
|
sometimes difficult to infer. @xref{Bugs, , , gcc, GNU CC}.
|
|
|
|
This manual was originally derived from the Unix man pages in the
|
|
distribution, which were written by David MacKenzie and updated by Jim
|
|
Meyering. What you are reading now is the authoritative documentation
|
|
for these utilities; the man pages are no longer being maintained.
|
|
The original @code{fmt} man page was written by Ross Paterson.
|
|
Fran@,{c}ois Pinard did the initial conversion to Texinfo format.
|
|
Karl Berry did the indexing, some reorganization, and editing of the results.
|
|
Richard Stallman contributed his usual invaluable insights to the
|
|
overall process.
|
|
|
|
|
|
@node Common options
|
|
@chapter Common options
|
|
|
|
@cindex common options
|
|
|
|
Certain options are available in all these programs. Rather than
|
|
writing identical descriptions for each of the programs, they are
|
|
described here. (In fact, every GNU program accepts (or should accept)
|
|
these options.)
|
|
|
|
A few of these programs take arbitrary strings as arguments. In those
|
|
cases, @samp{--help} and @samp{--version} are taken as these options
|
|
only if there is one and exactly one command line argument.
|
|
|
|
@table @samp
|
|
|
|
@item --help
|
|
@opindex --help
|
|
@cindex help, online
|
|
Print a usage message listing all available options, then exit successfully.
|
|
|
|
@item --version
|
|
@opindex --version
|
|
@cindex version number, finding
|
|
Print the version number, then exit successfully.
|
|
|
|
@end table
|
|
|
|
|
|
@node Output of entire files
|
|
@chapter Output of entire files
|
|
|
|
@cindex output of entire files
|
|
@cindex entire files, output of
|
|
|
|
These commands read and write entire files, possibly transforming them
|
|
in some way.
|
|
|
|
@menu
|
|
* cat invocation:: Concatenate and write files.
|
|
* tac invocation:: Concatenate and write files in reverse.
|
|
* nl invocation:: Number lines and write files.
|
|
* od invocation:: Write files in octal or other formats.
|
|
@end menu
|
|
|
|
@node cat invocation
|
|
@section @code{cat}: Concatenate and write files
|
|
|
|
@pindex cat
|
|
@cindex concatenate and write files
|
|
@cindex copying files
|
|
|
|
@code{cat} copies each @var{file} (@samp{-} means standard input), or
|
|
standard input if none are given, to standard output. Synopsis:
|
|
|
|
@example
|
|
cat [@var{option}] [@var{file}]@dots{}
|
|
@end example
|
|
|
|
The program accepts the following options. Also see @ref{Common options}.
|
|
|
|
@table @samp
|
|
|
|
@item -A
|
|
@itemx --show-all
|
|
@opindex -A
|
|
@opindex --show-all
|
|
Equivalent to @samp{-vET}.
|
|
|
|
@item -B
|
|
@itemx --binary
|
|
@opindex -B
|
|
@opindex --binary
|
|
@cindex binary and text I/O in cat
|
|
On MS-DOS and MS-Windows only, causes @code{cat} read and write the
|
|
files in binary mode. By default, @code{cat} on MS-DOS/MS-Windows uses
|
|
binary mode only when standard output is redirected to a file or a pipe;
|
|
this option overrides that. Binary file I/O is used so that the files
|
|
retain their format (Unix text as opposed to DOS text and binary),
|
|
because @code{cat} is frequently used as file copying program. Some
|
|
options (see below) cause @code{cat} read and write files in text mode
|
|
because then the original file contents aren't important (e.g., when
|
|
lines are numbered by @code{cat}, or when line endings should be
|
|
marked). This is so these options work as DOS/Windows users would
|
|
expect; for example, DOS-style text files have their lines end with
|
|
@key{CR-LF} pair of characters which won't be processed as an empty line
|
|
by @samp{-b} unless the file is read in text mode.
|
|
|
|
@item -b
|
|
@itemx --number-nonblank
|
|
@opindex -b
|
|
@opindex --number-nonblank
|
|
Number all nonblank output lines, starting with 1. On MS-DOS and
|
|
MS-Windows, this option causes @code{cat} to read and write files in
|
|
text mode.
|
|
|
|
@item -e
|
|
@opindex -e
|
|
Equivalent to @samp{-vE}.
|
|
|
|
@item -E
|
|
@itemx --show-ends
|
|
@opindex -E
|
|
@opindex --show-ends
|
|
Display a @samp{$} after the end of each line. On MS-DOS and
|
|
MS-Windows, this option causes @code{cat} to read and write files in
|
|
text mode.
|
|
|
|
@item -n
|
|
@itemx --number
|
|
@opindex -n
|
|
@opindex --number
|
|
Number all output lines, starting with 1. On MS-DOS and MS-Windows,
|
|
this option causes @code{cat} to read and write files in text mode.
|
|
|
|
@item -s
|
|
@itemx --squeeze-blank
|
|
@opindex -s
|
|
@opindex --squeeze-blank
|
|
@cindex squeezing blank lines
|
|
Replace multiple adjacent blank lines with a single blank line. On
|
|
MS-DOS and MS-Windows, this option causes @code{cat} to read and write
|
|
files in text mode.
|
|
|
|
@item -t
|
|
@opindex -t
|
|
Equivalent to @samp{-vT}.
|
|
|
|
@item -T
|
|
@itemx --show-tabs
|
|
@opindex -T
|
|
@opindex --show-tabs
|
|
Display @key{TAB} characters as @samp{^I}.
|
|
|
|
@item -u
|
|
@opindex -u
|
|
Ignored; for Unix compatibility.
|
|
|
|
@item -v
|
|
@itemx --show-nonprinting
|
|
@opindex -v
|
|
@opindex --show-nonprinting
|
|
Display control characters except for @key{LFD} and @key{TAB} using
|
|
@samp{^} notation and precede characters that have the high bit set with
|
|
@samp{M-}. On MS-DOS and MS-Windows, this option causes @code{cat} to
|
|
read files and standard input in DOS binary mode, so the @key{CR}
|
|
characters at the end of each line are also visible.
|
|
|
|
@end table
|
|
|
|
|
|
@node tac invocation
|
|
@section @code{tac}: Concatenate and write files in reverse
|
|
|
|
@pindex tac
|
|
@cindex reversing files
|
|
|
|
@code{tac} copies each @var{file} (@samp{-} means standard input), or
|
|
standard input if none are given, to standard output, reversing the
|
|
records (lines by default) in each separately. Synopsis:
|
|
|
|
@example
|
|
tac [@var{option}]@dots{} [@var{file}]@dots{}
|
|
@end example
|
|
|
|
@dfn{Records} are separated by instances of a string (newline by
|
|
default). By default, this separator string is attached to the end of
|
|
the record that it follows in the file.
|
|
|
|
The program accepts the following options. Also see @ref{Common options}.
|
|
|
|
@table @samp
|
|
|
|
@item -b
|
|
@itemx --before
|
|
@opindex -b
|
|
@opindex --before
|
|
The separator is attached to the beginning of the record that it
|
|
precedes in the file.
|
|
|
|
@item -r
|
|
@itemx --regex
|
|
@opindex -r
|
|
@opindex --regex
|
|
Treat the separator string as a regular expression. Users of @code{tac}
|
|
on MS-DOS/MS-Windows should note that, since @code{tac} reads files in
|
|
binary mode, each line of a text file might end with a CR/LF pair
|
|
instead of the Unix-style LF.
|
|
|
|
@item -s @var{separator}
|
|
@itemx --separator=@var{separator}
|
|
@opindex -s
|
|
@opindex --separator
|
|
Use @var{separator} as the record separator, instead of newline.
|
|
|
|
@end table
|
|
|
|
|
|
@node nl invocation
|
|
@section @code{nl}: Number lines and write files
|
|
|
|
@pindex nl
|
|
@cindex numbering lines
|
|
@cindex line numbering
|
|
|
|
@code{nl} writes each @var{file} (@samp{-} means standard input), or
|
|
standard input if none are given, to standard output, with line numbers
|
|
added to some or all of the lines. Synopsis:
|
|
|
|
@example
|
|
nl [@var{option}]@dots{} [@var{file}]@dots{}
|
|
@end example
|
|
|
|
@cindex logical pages, numbering on
|
|
@code{nl} decomposes its input into (logical) pages; by default, the
|
|
line number is reset to 1 at the top of each logical page. @code{nl}
|
|
treats all of the input files as a single document; it does not reset
|
|
line numbers or logical pages between files.
|
|
|
|
@cindex headers, numbering
|
|
@cindex body, numbering
|
|
@cindex footers, numbering
|
|
A logical page consists of three sections: header, body, and footer.
|
|
Any of the sections can be empty. Each can be numbered in a different
|
|
style from the others.
|
|
|
|
The beginnings of the sections of logical pages are indicated in the
|
|
input file by a line containing exactly one of these delimiter strings:
|
|
|
|
@table @samp
|
|
@item \:\:\:
|
|
start of header;
|
|
@item \:\:
|
|
start of body;
|
|
@item \:
|
|
start of footer.
|
|
@end table
|
|
|
|
The two characters from which these strings are made can be changed from
|
|
@samp{\} and @samp{:} via options (see below), but the pattern and
|
|
length of each string cannot be changed.
|
|
|
|
A section delimiter is replaced by an empty line on output. Any text
|
|
that comes before the first section delimiter string in the input file
|
|
is considered to be part of a body section, so @code{nl} treats a
|
|
file that contains no section delimiters as a single body section.
|
|
|
|
The program accepts the following options. Also see @ref{Common options}.
|
|
|
|
@table @samp
|
|
|
|
@item -b @var{style}
|
|
@itemx --body-numbering=@var{style}
|
|
@opindex -b
|
|
@opindex --body-numbering
|
|
Select the numbering style for lines in the body section of each
|
|
logical page. When a line is not numbered, the current line number
|
|
is not incremented, but the line number separator character is still
|
|
prepended to the line. The styles are:
|
|
|
|
@table @samp
|
|
@item a
|
|
number all lines,
|
|
@item t
|
|
number only nonempty lines (default for body),
|
|
@item n
|
|
do not number lines (default for header and footer),
|
|
@item p@var{regexp}
|
|
number only lines that contain a match for @var{regexp}.
|
|
@end table
|
|
|
|
@item -d @var{cd}
|
|
@itemx --section-delimiter=@var{cd}
|
|
@opindex -d
|
|
@opindex --section-delimiter
|
|
@cindex section delimiters of pages
|
|
Set the section delimiter characters to @var{cd}; default is
|
|
@samp{\:}. If only @var{c} is given, the second remains @samp{:}.
|
|
(Remember to protect @samp{\} or other metacharacters from shell
|
|
expansion with quotes or extra backslashes.)
|
|
|
|
@item -f @var{style}
|
|
@itemx --footer-numbering=@var{style}
|
|
@opindex -f
|
|
@opindex --footer-numbering
|
|
Analogous to @samp{--body-numbering}.
|
|
|
|
@item -h @var{style}
|
|
@itemx --header-numbering=@var{style}
|
|
@opindex -h
|
|
@opindex --header-numbering
|
|
Analogous to @samp{--body-numbering}.
|
|
|
|
@item -i @var{number}
|
|
@itemx --page-increment=@var{number}
|
|
@opindex -i
|
|
@opindex --page-increment
|
|
Increment line numbers by @var{number} (default 1).
|
|
|
|
@item -l @var{number}
|
|
@itemx --join-blank-lines=@var{number}
|
|
@opindex -l
|
|
@opindex --join-blank-lines
|
|
@cindex empty lines, numbering
|
|
@cindex blank lines, numbering
|
|
Consider @var{number} (default 1) consecutive empty lines to be one
|
|
logical line for numbering, and only number the last one. Where fewer
|
|
than @var{number} consecutive empty lines occur, do not number them.
|
|
An empty line is one that contains no characters, not even spaces
|
|
or tabs.
|
|
|
|
@item -n @var{format}
|
|
@itemx --number-format=@var{format}
|
|
@opindex -n
|
|
@opindex --number-format
|
|
Select the line numbering format (default is @code{rn}):
|
|
|
|
@table @samp
|
|
@item ln
|
|
@opindex ln @r{format for @code{nl}}
|
|
left justified, no leading zeros;
|
|
@item rn
|
|
@opindex rn @r{format for @code{nl}}
|
|
right justified, no leading zeros;
|
|
@item rz
|
|
@opindex rz @r{format for @code{nl}}
|
|
right justified, leading zeros.
|
|
@end table
|
|
|
|
@item -p
|
|
@itemx --no-renumber
|
|
@opindex -p
|
|
@opindex --no-renumber
|
|
Do not reset the line number at the start of a logical page.
|
|
|
|
@item -s @var{string}
|
|
@itemx --number-separator=@var{string}
|
|
@opindex -s
|
|
@opindex --number-separator
|
|
Separate the line number from the text line in the output with
|
|
@var{string} (default is @key{TAB}).
|
|
|
|
@item -v @var{number}
|
|
@itemx --starting-line-number=@var{number}
|
|
@opindex -v
|
|
@opindex --starting-line-number
|
|
Set the initial line number on each logical page to @var{number} (default 1).
|
|
|
|
@item -w @var{number}
|
|
@itemx --number-width=@var{number}
|
|
@opindex -w
|
|
@opindex --number-width
|
|
Use @var{number} characters for line numbers (default 6).
|
|
|
|
@end table
|
|
|
|
|
|
@node od invocation
|
|
@section @code{od}: Write files in octal or other formats
|
|
|
|
@pindex od
|
|
@cindex octal dump of files
|
|
@cindex hex dump of files
|
|
@cindex ASCII dump of files
|
|
@cindex file contents, dumping unambiguously
|
|
|
|
@code{od} writes an unambiguous representation of each @var{file}
|
|
(@samp{-} means standard input), or standard input if none are given.
|
|
Synopsis:
|
|
|
|
@example
|
|
od [@var{option}]@dots{} [@var{file}]@dots{}
|
|
od -C [@var{file}] [[+]@var{offset} [[+]@var{label}]]
|
|
@end example
|
|
|
|
Each line of output consists of the offset in the input, followed by
|
|
groups of data from the file. By default, @code{od} prints the offset in
|
|
octal, and each group of file data is two bytes of input printed as a
|
|
single octal number.
|
|
|
|
The program accepts the following options. Also see @ref{Common options}.
|
|
|
|
@table @samp
|
|
|
|
@item -A @var{radix}
|
|
@itemx --address-radix=@var{radix}
|
|
@opindex -A
|
|
@opindex --address-radix
|
|
@cindex radix for file offsets
|
|
@cindex file offset radix
|
|
Select the base in which file offsets are printed. @var{radix} can
|
|
be one of the following:
|
|
|
|
@table @samp
|
|
@item d
|
|
decimal;
|
|
@item o
|
|
octal;
|
|
@item x
|
|
hexadecimal;
|
|
@item n
|
|
none (do not print offsets).
|
|
@end table
|
|
|
|
The default is octal.
|
|
|
|
@item -j @var{bytes}
|
|
@itemx --skip-bytes=@var{bytes}
|
|
@opindex -j
|
|
@opindex --skip-bytes
|
|
Skip @var{bytes} input bytes before formatting and writing. If
|
|
@var{bytes} begins with @samp{0x} or @samp{0X}, it is interpreted in
|
|
hexadecimal; otherwise, if it begins with @samp{0}, in octal; otherwise,
|
|
in decimal. Appending @samp{b} multiplies @var{bytes} by 512, @samp{k}
|
|
by 1024, and @samp{m} by 1048576.
|
|
|
|
@item -N @var{bytes}
|
|
@itemx --read-bytes=@var{bytes}
|
|
@opindex -N
|
|
@opindex --read-bytes
|
|
Output at most @var{bytes} bytes of the input. Prefixes and suffixes on
|
|
@code{bytes} are interpreted as for the @samp{-j} option.
|
|
|
|
@item -s [@var{n}]
|
|
@itemx --strings[=@var{n}]
|
|
@opindex -s
|
|
@opindex --strings
|
|
@cindex string constants, outputting
|
|
Instead of the normal output, output only @dfn{string constants}: at
|
|
least @var{n} (3 by default) consecutive ASCII graphic characters,
|
|
followed by a null (zero) byte.
|
|
|
|
@item -t @var{type}
|
|
@itemx --format=@var{type}
|
|
@opindex -t
|
|
@opindex --format
|
|
Select the format in which to output the file data. @var{type} is a
|
|
string of one or more of the below type indicator characters. If you
|
|
include more than one type indicator character in a single @var{type}
|
|
string, or use this option more than once, @code{od} writes one copy
|
|
of each output line using each of the data types that you specified,
|
|
in the order that you specified.
|
|
|
|
Adding a trailing ``z'' to any type specification appends a display
|
|
of the ASCII character representation of the printable characters
|
|
to the output line generated by the type specification.
|
|
|
|
@table @samp
|
|
@item a
|
|
named character,
|
|
@item c
|
|
ASCII character or backslash escape,
|
|
@item d
|
|
signed decimal,
|
|
@item f
|
|
floating point,
|
|
@item o
|
|
octal,
|
|
@item u
|
|
unsigned decimal,
|
|
@item x
|
|
hexadecimal.
|
|
@end table
|
|
|
|
The type @code{a} outputs things like @samp{sp} for space, @samp{nl} for
|
|
newline, and @samp{nul} for a null (zero) byte. Type @code{c} outputs
|
|
@samp{ }, @samp{\n}, and @code{\0}, respectively.
|
|
|
|
@cindex type size
|
|
Except for types @samp{a} and @samp{c}, you can specify the number
|
|
of bytes to use in interpreting each number in the given data type
|
|
by following the type indicator character with a decimal integer.
|
|
Alternately, you can specify the size of one of the C compiler's
|
|
built-in data types by following the type indicator character with
|
|
one of the following characters. For integers (@samp{d}, @samp{o},
|
|
@samp{u}, @samp{x}):
|
|
|
|
@table @samp
|
|
@item C
|
|
char,
|
|
@item S
|
|
short,
|
|
@item I
|
|
int,
|
|
@item L
|
|
long.
|
|
@end table
|
|
|
|
For floating point (@code{f}):
|
|
|
|
@table @asis
|
|
@item F
|
|
float,
|
|
@item D
|
|
double,
|
|
@item L
|
|
long double.
|
|
@end table
|
|
|
|
@item -v
|
|
@itemx --output-duplicates
|
|
@opindex -v
|
|
@opindex --output-duplicates
|
|
Output consecutive lines that are identical. By default, when two or
|
|
more consecutive output lines would be identical, @code{od} outputs only
|
|
the first line, and puts just an asterisk on the following line to
|
|
indicate the elision.
|
|
|
|
@item -w[@var{n}]
|
|
@itemx --width[=@var{n}]
|
|
@opindex -w
|
|
@opindex --width
|
|
Dump @code{n} input bytes per output line. This must be a multiple of
|
|
the least common multiple of the sizes associated with the specified
|
|
output types. If @var{n} is omitted, the default is 32. If this option
|
|
is not given at all, the default is 16.
|
|
|
|
@end table
|
|
|
|
The next several options map the old, pre-@sc{POSIX} format specification
|
|
options to the corresponding @sc{POSIX} format specs. GNU @code{od} accepts
|
|
any combination of old- and new-style options. Format specification
|
|
options accumulate.
|
|
|
|
@table @samp
|
|
|
|
@item -a
|
|
@opindex -a
|
|
Output as named characters. Equivalent to @samp{-ta}.
|
|
|
|
@item -b
|
|
@opindex -b
|
|
Output as octal bytes. Equivalent to @samp{-toC}.
|
|
|
|
@item -c
|
|
@opindex -c
|
|
Output as ASCII characters or backslash escapes. Equivalent to
|
|
@samp{-tc}.
|
|
|
|
@item -d
|
|
@opindex -d
|
|
Output as unsigned decimal shorts. Equivalent to @samp{-tu2}.
|
|
|
|
@item -f
|
|
@opindex -f
|
|
Output as floats. Equivalent to @samp{-tfF}.
|
|
|
|
@item -h
|
|
@opindex -h
|
|
Output as hexadecimal shorts. Equivalent to @samp{-tx2}.
|
|
|
|
@item -i
|
|
@opindex -i
|
|
Output as decimal shorts. Equivalent to @samp{-td2}.
|
|
|
|
@item -l
|
|
@opindex -l
|
|
Output as decimal longs. Equivalent to @samp{-td4}.
|
|
|
|
@item -o
|
|
@opindex -o
|
|
Output as octal shorts. Equivalent to @samp{-to2}.
|
|
|
|
@item -x
|
|
@opindex -x
|
|
Output as hexadecimal shorts. Equivalent to @samp{-tx2}.
|
|
|
|
@item -C
|
|
@itemx --traditional
|
|
@opindex --traditional
|
|
Recognize the pre-POSIX non-option arguments that traditional @code{od}
|
|
accepted. The following syntax:
|
|
|
|
@example
|
|
od --traditional [@var{file}] [[+]@var{offset}[.][b] [[+]@var{label}[.][b]]]
|
|
@end example
|
|
|
|
@noindent
|
|
can be used to specify at most one file and optional arguments
|
|
specifying an offset and a pseudo-start address, @var{label}. By
|
|
default, @var{offset} is interpreted as an octal number specifying how
|
|
many input bytes to skip before formatting and writing. The optional
|
|
trailing decimal point forces the interpretation of @var{offset} as a
|
|
decimal number. If no decimal is specified and the offset begins with
|
|
@samp{0x} or @samp{0X} it is interpreted as a hexadecimal number. If
|
|
there is a trailing @samp{b}, the number of bytes skipped will be
|
|
@var{offset} multiplied by 512. The @var{label} argument is interpreted
|
|
just like @var{offset}, but it specifies an initial pseudo-address. The
|
|
pseudo-addresses are displayed in parentheses following any normal
|
|
address.
|
|
|
|
@end table
|
|
|
|
|
|
@node Formatting file contents
|
|
@chapter Formatting file contents
|
|
|
|
@cindex formatting file contents
|
|
|
|
These commands reformat the contents of files.
|
|
|
|
@menu
|
|
* fmt invocation:: Reformat paragraph text.
|
|
* pr invocation:: Paginate or columnate files for printing.
|
|
* fold invocation:: Wrap input lines to fit in specified width.
|
|
@end menu
|
|
|
|
|
|
@node fmt invocation
|
|
@section @code{fmt}: Reformat paragraph text
|
|
|
|
@pindex fmt
|
|
@cindex reformatting paragraph text
|
|
@cindex paragraphs, reformatting
|
|
@cindex text, reformatting
|
|
|
|
@code{fmt} fills and joins lines to produce output lines of (at most)
|
|
a given number of characters (75 by default). Synopsis:
|
|
|
|
@example
|
|
fmt [@var{option}]@dots{} [@var{file}]@dots{}
|
|
@end example
|
|
|
|
@code{fmt} reads from the specified @var{file} arguments (or standard
|
|
input if none are given), and writes to standard output.
|
|
|
|
By default, blank lines, spaces between words, and indentation are
|
|
preserved in the output; successive input lines with different
|
|
indentation are not joined; tabs are expanded on input and introduced on
|
|
output.
|
|
|
|
@cindex line-breaking
|
|
@cindex sentences and line-breaking
|
|
@cindex Knuth, Donald E.
|
|
@cindex Plass, Michael F.
|
|
@code{fmt} prefers breaking lines at the end of a sentence, and tries to
|
|
avoid line breaks after the first word of a sentence or before the last
|
|
word of a sentence. A @dfn{sentence break} is defined as either the end
|
|
of a paragraph or a word ending in any of @samp{.?!}, followed by two
|
|
spaces or end of line, ignoring any intervening parentheses or quotes.
|
|
Like @TeX{}, @code{fmt} reads entire ``paragraphs'' before choosing line
|
|
breaks; the algorithm is a variant of that in ``Breaking Paragraphs Into
|
|
Lines'' (Donald E. Knuth and Michael F. Plass, @cite{Software---Practice
|
|
and Experience}, 11 (1981), 1119--1184).
|
|
|
|
The program accepts the following options. Also see @ref{Common options}.
|
|
|
|
@table @samp
|
|
|
|
@item -c
|
|
@itemx --crown-margin
|
|
@opindex -c
|
|
@opindex --crown-margin
|
|
@cindex crown margin
|
|
@dfn{Crown margin} mode: preserve the indentation of the first two
|
|
lines within a paragraph, and align the left margin of each subsequent
|
|
line with that of the second line.
|
|
|
|
@item -t
|
|
@itemx --tagged-paragraph
|
|
@opindex -t
|
|
@opindex --tagged-paragraph
|
|
@cindex tagged paragraphs
|
|
@dfn{Tagged paragraph} mode: like crown margin mode, except that if
|
|
indentation of the first line of a paragraph is the same as the
|
|
indentation of the second, the first line is treated as a one-line
|
|
paragraph.
|
|
|
|
@item -s
|
|
@itemx --split-only
|
|
@opindex -s
|
|
@opindex --split-only
|
|
Split lines only. Do not join short lines to form longer ones. This
|
|
prevents sample lines of code, and other such ``formatted'' text from
|
|
being unduly combined.
|
|
|
|
@item -u
|
|
@itemx --uniform-spacing
|
|
@opindex -u
|
|
@opindex --uniform-spacing
|
|
Uniform spacing. Reduce spacing between words to one space, and spacing
|
|
between sentences to two spaces.
|
|
|
|
@item -@var{width}
|
|
@itemx -w @var{width}
|
|
@itemx --width=@var{width}
|
|
@opindex -@var{width}
|
|
@opindex -w
|
|
@opindex --width
|
|
Fill output lines up to @var{width} characters (default 75). @code{fmt}
|
|
initially tries to make lines about 7% shorter than this, to give it
|
|
room to balance line lengths.
|
|
|
|
@item -p @var{prefix}
|
|
@itemx --prefix=@var{prefix}
|
|
Only lines beginning with @var{prefix} (possibly preceded by whitespace)
|
|
are subject to formatting. The prefix and any preceding whitespace are
|
|
stripped for the formatting and then re-attached to each formatted output
|
|
line. One use is to format certain kinds of program comments, while
|
|
leaving the code unchanged.
|
|
|
|
@end table
|
|
|
|
|
|
@node pr invocation
|
|
@section @code{pr}: Paginate or columnate files for printing
|
|
|
|
@pindex pr
|
|
@cindex printing, preparing files for
|
|
@cindex multicolumn output, generating
|
|
@cindex merging files in parallel
|
|
|
|
@code{pr} writes each @var{file} (@samp{-} means standard input), or
|
|
standard input if none are given, to standard output, paginating and
|
|
optionally outputting in multicolumn format; optionally merges all
|
|
@var{file}s, printing all in parallel, one per column. Synopsis:
|
|
|
|
@example
|
|
pr [@var{option}]@dots{} [@var{file}]@dots{}
|
|
@end example
|
|
|
|
By default, a 5-line header is printed: two blank lines; a line with the
|
|
date, the file name, and the page count; and two more blank lines. A
|
|
footer of five blank lines is also printed. With the @samp{-f} option, a
|
|
3-line header is printed: the leading two blank lines are omitted; no
|
|
footer used. The default @var{page_length} in both cases is 66 lines.
|
|
The text line of the header takes up the full @var{page_width} in the
|
|
form @samp{yy-mm-dd HH:MM string Page nnnn}. String is a centered
|
|
string.
|
|
|
|
Form feeds in the input cause page breaks in the output. Multiple form
|
|
feeds produce empty pages.
|
|
|
|
Columns have equal width, separated by an optional string (default
|
|
space). Lines will always be truncated to line width (default 72),
|
|
unless you use the @samp{-j} option. For single column output no line
|
|
truncation occurs by default. Use @samp{-w} option to truncate lines
|
|
in that case.
|
|
|
|
The program accepts the following options. Also see @ref{Common options}.
|
|
|
|
@table @samp
|
|
|
|
@item +@var{first_page}[:@var{last_page}]
|
|
@itemx --pages=@var{first_page}[:@var{last_page}]
|
|
@opindex +@var{first_page}[:@var{last_page}]
|
|
@opindex --pages
|
|
Begin printing with page @var{first_page} and stop with
|
|
@var{last_page}. Missing @samp{:@var{last_page}} implies end of file. While
|
|
estimating the number of skipped pages each form feed in the input file
|
|
results in a new page. Page counting with and without
|
|
@samp{+@var{first_page}} is identical. By default, it starts with the
|
|
first page of input file (not first page printed). Page numbering may be
|
|
altered by @samp{-N} option.
|
|
|
|
@item -@var{column}
|
|
@itemx --columns=@var{column}
|
|
@opindex -@var{column}
|
|
@opindex --columns
|
|
@cindex down columns
|
|
With each single @var{file}, produce @var{column}-column output and
|
|
print columns down. The column width is automatically estimated from
|
|
@var{page_width}. This option might well cause some columns to be
|
|
truncated. The number of lines in the columns on each page will be
|
|
balanced. @samp{-@var{column}} may not be used with @samp{-m} option.
|
|
|
|
@item -a
|
|
@itemx --across
|
|
@opindex -a
|
|
@opindex --across
|
|
@cindex across columns
|
|
With each single @var{file}, print columns across rather than down.
|
|
@var{column} must be greater than one.
|
|
|
|
@item -c
|
|
@itemx --show-control-chars
|
|
@opindex -c
|
|
@opindex --show-control-chars
|
|
Print control characters using hat notation (e.g., @samp{^G}); print
|
|
other unprintable characters in octal backslash notation. By default,
|
|
unprintable characters are not changed.
|
|
|
|
@item -d
|
|
@itemx --double-space
|
|
@opindex -d
|
|
@opindex --double-space
|
|
@cindex double spacing
|
|
Double space the output.
|
|
|
|
@item -e[@var{in-tabchar}[@var{in-tabwidth}]]
|
|
@itemx --expand-tabs[=@var{in-tabchar}[@var{in-tabwidth}]]
|
|
@opindex -e
|
|
@opindex --expand-tabs
|
|
@cindex input tabs
|
|
Expand tabs to spaces on input. Optional argument @var{in-tabchar} is
|
|
the input tab character (default is @key{TAB}). Second optional
|
|
argument @var{in-tabwidth} is the input tab character's width (default
|
|
is 8).
|
|
|
|
@item -f
|
|
@itemx -F
|
|
@itemx --form-feed
|
|
@opindex -F
|
|
@opindex -f
|
|
@opindex --form-feed
|
|
Use a form feed instead of newlines to separate output pages. Default
|
|
page length of 66 lines is not altered. But the number of lines of text
|
|
per page changes from 56 to 63 lines.
|
|
|
|
|
|
@item -h @var{HEADER}
|
|
@itemx --header=@var{HEADER}
|
|
@opindex -h
|
|
@opindex --header
|
|
Replace the file name in the header with the centered string
|
|
@var{header}. Left-hand-side truncation (marked by a @samp{*}) may occur
|
|
if the total header line @samp{yy-mm-dd HH:MM HEADER Page nnnn}
|
|
becomes larger than @var{page_width}. @samp{-h ""} prints a blank line
|
|
header. Don't use @samp{-h""}. A space between the -h option and the
|
|
argument is always peremptory.
|
|
|
|
@item -i[@var{out-tabchar}[@var{out-tabwidth}]]
|
|
@itemx --output-tabs[=@var{out-tabchar}[@var{out-tabwidth}]]
|
|
@opindex -i
|
|
@opindex --output-tabs
|
|
@cindex output tabs
|
|
Replace spaces with tabs on output. Optional argument @var{out-tabchar}
|
|
is the output tab character (default is @key{TAB}). Second optional
|
|
argument @var{out-tabwidth} is the output tab character's width (default
|
|
is 8).
|
|
|
|
@item -j
|
|
@itemx --join-lines
|
|
@opindex -j
|
|
@opindex --join-lines
|
|
Merge lines of full length. Used together with the column options
|
|
@samp{-@var{column}}, @samp{-a -@var{column}} or @samp{-m}. Turns off
|
|
@samp{-w} line truncation; no column alignment used; may be used with
|
|
@samp{-s[@var{separator}]}.
|
|
|
|
|
|
@item -l @var{page_length}
|
|
@itemx --length=@var{page_length}
|
|
@opindex -l
|
|
@opindex --length
|
|
Set the page length to @var{page_length} (default 66) lines. If
|
|
@var{page_length} is less than or equal 10 (and <= 3 with @samp{-f}),
|
|
the headers and footers are omitted, and all form feeds set in input
|
|
files are eliminated, as if the @samp{-T} option had been given.
|
|
|
|
@item -m
|
|
@itemx --merge
|
|
@opindex -m
|
|
@opindex --merge
|
|
Merge and print all @var{file}s in parallel, one in each column. If a
|
|
line is too long to fit in a column, it is truncated (but see
|
|
@samp{-j}). @samp{-s[@var{separator}]} may be used. Empty pages in some
|
|
@var{file}s (form feeds set) produce empty columns, still marked by
|
|
@var{separator}. Completely empty common pages show no separators or
|
|
line numbers. The default header becomes
|
|
@samp{yy-mm-dd HH:MM <blanks> Page nnnn}; may be used with
|
|
@samp{-h @var{header}} to fill up the middle part.
|
|
|
|
|
|
@item -n[@var{number-separator}[@var{digits}]]
|
|
@itemx --number-lines[=@var{number-separator}[@var{digits}]]
|
|
@opindex -n
|
|
@opindex --number-lines
|
|
Precede each column with a line number; with parallel @var{file}s
|
|
(@samp{-m}), precede only each line with a line number. Optional argument
|
|
@var{number-separator} is the character to print after each number
|
|
(default is @key{TAB}). Optional argument @var{digits} is the number of
|
|
digits per line number (default is 5). Default line counting starts with
|
|
first line of the input file (not with the first line printed, see
|
|
@samp{-N}).
|
|
|
|
@item -N @var{line_number}
|
|
@itemx --first-line-number=@var{line_number}
|
|
@opindex -N
|
|
@opindex --first-line-number
|
|
Start line counting with no. @var{line_number} at first line of first
|
|
page printed.
|
|
|
|
@item -o @var{n}
|
|
@itemx --indent=@var{n}
|
|
@opindex -o
|
|
@opindex --indent
|
|
@cindex indenting lines
|
|
@cindex left margin
|
|
Indent each line with @var{n} (default is zero) spaces wide, i.e., set
|
|
the left margin. The total page width is @var{n} plus the width set
|
|
with the @samp{-w} option.
|
|
|
|
@item -r
|
|
@itemx --no-file-warnings
|
|
@opindex -r
|
|
@opindex --no-file-warnings
|
|
Do not print a warning message when an argument @var{file} cannot be
|
|
opened. (The exit status will still be nonzero, however.)
|
|
|
|
@item -s[@var{separator}]
|
|
@itemx --separator[=@var{separator}]
|
|
@opindex -s
|
|
@opindex --separator
|
|
Separate columns by a string @var{separator}. Don't use
|
|
@samp{-s @var{separator}}, no space between flag and argument. If this
|
|
option is omitted altogether, the default is @key{TAB} together with
|
|
@samp{-j} option and space otherwise (same as @samp{-s" "}). With
|
|
@samp{-s} only, no separator is used (same as @samp{-s""}). @samp{-s}
|
|
does not affect line truncation or column alignment.
|
|
|
|
@item -t
|
|
@itemx --omit-header
|
|
@opindex -t
|
|
@opindex --omit-header
|
|
Do not print the usual header [and footer] on each page, and do not fill
|
|
out the bottoms of pages (with blank lines or a form feed). No page
|
|
structure is produced, but retain form feeds set in the input files. The
|
|
predefined page layout is not changed. @samp{-t} or @samp{-T} may be
|
|
useful together with other options; e.g.: @samp{-t -e4}, expand
|
|
@key{TAB} in the input file to 4 spaces but do not do any other changes.
|
|
Use of @samp{-t} overrides @samp{-h}.
|
|
|
|
@item -T
|
|
@itemx --omit-pagination
|
|
@opindex -T
|
|
@opindex --omit-pagination
|
|
Do not print header [and footer]. In addition eliminate all form feeds
|
|
set in the input files.
|
|
|
|
@item -v
|
|
@itemx --show-nonprinting
|
|
@opindex -v
|
|
@opindex --show-nonprinting
|
|
Print unprintable characters in octal backslash notation.
|
|
|
|
@item -w @var{page_width}
|
|
@itemx --width=@var{page_width}
|
|
@opindex -w
|
|
@opindex --width
|
|
Set the page width to @var{page_width} (default 72) characters.
|
|
With/without @samp{-w}, header lines are always truncated to
|
|
@var{page_width} characters. With @samp{-w}, text lines are truncated,
|
|
unless @samp{-j} is used. Without @samp{-w} together with one of the
|
|
column options @samp{-@var{column}}, @samp{-a -@var{column}} or
|
|
@samp{-m}, default truncation of text lines to 72 characters is used.
|
|
Without @samp{-w} and without any of the column options, no line
|
|
truncation is used. That's equivalent to @samp{-w 72 -j}.
|
|
|
|
@end table
|
|
|
|
|
|
@node fold invocation
|
|
@section @code{fold}: Wrap input lines to fit in specified width
|
|
|
|
@pindex fold
|
|
@cindex wrapping long input lines
|
|
@cindex folding long input lines
|
|
|
|
@code{fold} writes each @var{file} (@samp{-} means standard input), or
|
|
standard input if none are given, to standard output, breaking long
|
|
lines. Synopsis:
|
|
|
|
@example
|
|
fold [@var{option}]@dots{} [@var{file}]@dots{}
|
|
@end example
|
|
|
|
By default, @code{fold} breaks lines wider than 80 columns. The output
|
|
is split into as many lines as necessary.
|
|
|
|
@cindex screen columns
|
|
@code{fold} counts screen columns by default; thus, a tab may count more
|
|
than one column, backspace decreases the column count, and carriage
|
|
return sets the column to zero.
|
|
|
|
The program accepts the following options. Also see @ref{Common options}.
|
|
|
|
@table @samp
|
|
|
|
@item -b
|
|
@itemx --bytes
|
|
@opindex -b
|
|
@opindex --bytes
|
|
Count bytes rather than columns, so that tabs, backspaces, and carriage
|
|
returns are each counted as taking up one column, just like other
|
|
characters.
|
|
|
|
@item -s
|
|
@itemx --spaces
|
|
@opindex -s
|
|
@opindex --spaces
|
|
Break at word boundaries: the line is broken after the last blank before
|
|
the maximum line length. If the line contains no such blanks, the line
|
|
is broken at the maximum line length as usual.
|
|
|
|
@item -w @var{width}
|
|
@itemx --width=@var{width}
|
|
@opindex -w
|
|
@opindex --width
|
|
Use a maximum line length of @var{width} columns instead of 80.
|
|
|
|
@end table
|
|
|
|
|
|
@node Output of parts of files
|
|
@chapter Output of parts of files
|
|
|
|
@cindex output of parts of files
|
|
@cindex parts of files, output of
|
|
|
|
These commands output pieces of the input.
|
|
|
|
@menu
|
|
* head invocation:: Output the first part of files.
|
|
* tail invocation:: Output the last part of files.
|
|
* split invocation:: Split a file into fixed-size pieces.
|
|
* csplit invocation:: Split a file into context-determined pieces.
|
|
@end menu
|
|
|
|
@node head invocation
|
|
@section @code{head}: Output the first part of files
|
|
|
|
@pindex head
|
|
@cindex initial part of files, outputting
|
|
@cindex first part of files, outputting
|
|
|
|
@code{head} prints the first part (10 lines by default) of each
|
|
@var{file}; it reads from standard input if no files are given or
|
|
when given a @var{file} of @samp{-}. Synopses:
|
|
|
|
@example
|
|
head [@var{option}]@dots{} [@var{file}]@dots{}
|
|
head -@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
|
|
@end example
|
|
|
|
If more than one @var{file} is specified, @code{head} prints a
|
|
one-line header consisting of
|
|
@example
|
|
==> @var{file name} <==
|
|
@end example
|
|
@noindent
|
|
before the output for each @var{file}.
|
|
|
|
@code{head} accepts two option formats: the new one, in which numbers
|
|
are arguments to the options (@samp{-q -n 1}), and the old one, in which
|
|
the number precedes any option letters (@samp{-1q}).
|
|
|
|
The program accepts the following options. Also see @ref{Common options}.
|
|
|
|
@table @samp
|
|
|
|
@item -@var{count}@var{options}
|
|
@opindex -@var{count}
|
|
This option is only recognized if it is specified first. @var{count} is
|
|
a decimal number optionally followed by a size letter (@samp{b},
|
|
@samp{k}, @samp{m}) as in @code{-c}, or @samp{l} to mean count by lines,
|
|
or other option letters (@samp{cqv}).
|
|
|
|
@item -c @var{bytes}
|
|
@itemx --bytes=@var{bytes}
|
|
@opindex -c
|
|
@opindex --bytes
|
|
Print the first @var{bytes} bytes, instead of initial lines. Appending
|
|
@samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and @samp{m}
|
|
by 1048576.
|
|
|
|
@itemx -n @var{n}
|
|
@itemx --lines=@var{n}
|
|
@opindex -n
|
|
@opindex --lines
|
|
Output the first @var{n} lines.
|
|
|
|
@item -q
|
|
@itemx --quiet
|
|
@itemx --silent
|
|
@opindex -q
|
|
@opindex --quiet
|
|
@opindex --silent
|
|
Never print file name headers.
|
|
|
|
@item -v
|
|
@itemx --verbose
|
|
@opindex -v
|
|
@opindex --verbose
|
|
Always print file name headers.
|
|
|
|
@end table
|
|
|
|
|
|
@node tail invocation
|
|
@section @code{tail}: Output the last part of files
|
|
|
|
@pindex tail
|
|
@cindex last part of files, outputting
|
|
|
|
@code{tail} prints the last part (10 lines by default) of each
|
|
@var{file}; it reads from standard input if no files are given or
|
|
when given a @var{file} of @samp{-}. Synopses:
|
|
|
|
@example
|
|
tail [@var{option}]@dots{} [@var{file}]@dots{}
|
|
tail -@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
|
|
tail +@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
|
|
@end example
|
|
|
|
If more than one @var{file} is specified, @code{tail} prints a
|
|
one-line header consisting of
|
|
@example
|
|
==> @var{file name} <==
|
|
@end example
|
|
@noindent
|
|
before the output for each @var{file}.
|
|
|
|
@cindex BSD @code{tail}
|
|
GNU @code{tail} can output any amount of data (some other versions of
|
|
@code{tail} cannot). It also has no @samp{-r} option (print in
|
|
reverse), since reversing a file is really a different job from printing
|
|
the end of a file; BSD @code{tail} (which is the one with @code{-r}) can
|
|
only reverse files that are at most as large as its buffer, which is
|
|
typically 32k. A more reliable and versatile way to reverse files is
|
|
the GNU @code{tac} command.
|
|
|
|
@code{tail} accepts two option formats: the new one, in which numbers
|
|
are arguments to the options (@samp{-n 1}), and the old one, in which
|
|
the number precedes any option letters (@samp{-1} or @samp{+1}).
|
|
|
|
If any option-argument is a number @var{n} starting with a @samp{+},
|
|
@code{tail} begins printing with the @var{n}th item from the start of
|
|
each file, instead of from the end.
|
|
|
|
The program accepts the following options. Also see @ref{Common options}.
|
|
|
|
@table @samp
|
|
|
|
@item -@var{count}
|
|
@itemx +@var{count}
|
|
@opindex -@var{count}
|
|
@opindex +@var{count}
|
|
This option is only recognized if it is specified first. @var{count} is
|
|
a decimal number optionally followed by a size letter (@samp{b},
|
|
@samp{k}, @samp{m}) as in @code{-c}, or @samp{l} to mean count by lines,
|
|
or other option letters (@samp{cfqv}).
|
|
|
|
@item -c @var{bytes}
|
|
@itemx --bytes=@var{bytes}
|
|
@opindex -c
|
|
@opindex --bytes
|
|
Output the last @var{bytes} bytes, instead of final lines. Appending
|
|
@samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and @samp{m}
|
|
by 1048576.
|
|
|
|
@item -f
|
|
@itemx --follow
|
|
@opindex -f
|
|
@opindex --follow
|
|
@cindex growing files
|
|
Loop forever trying to read more characters at the end of the file,
|
|
presumably because the file is growing. Ignored if reading from a pipe.
|
|
If more than one file is given, @code{tail} prints a header whenever it
|
|
gets output from a different file, to indicate which file that output is
|
|
from.
|
|
|
|
@itemx -n @var{n}
|
|
@itemx --lines=@var{n}
|
|
@opindex -n
|
|
@opindex --lines
|
|
Output the last @var{n} lines.
|
|
|
|
@item -q
|
|
@itemx -quiet
|
|
@itemx --silent
|
|
@opindex -q
|
|
@opindex --quiet
|
|
@opindex --silent
|
|
Never print file name headers.
|
|
|
|
@item -v
|
|
@itemx --verbose
|
|
@opindex -v
|
|
@opindex --verbose
|
|
Always print file name headers.
|
|
|
|
@end table
|
|
|
|
|
|
@node split invocation
|
|
@section @code{split}: Split a file into fixed-size pieces
|
|
|
|
@pindex split
|
|
@cindex splitting a file into pieces
|
|
@cindex pieces, splitting a file into
|
|
|
|
@code{split} creates output files containing consecutive sections of
|
|
@var{input} (standard input if none is given or @var{input} is
|
|
@samp{-}). Synopsis:
|
|
|
|
@example
|
|
split [@var{option}] [@var{input} [@var{prefix}]]
|
|
@end example
|
|
|
|
By default, @code{split} puts 1000 lines of @var{input} (or whatever is
|
|
left over for the last section), into each output file.
|
|
|
|
@cindex output file name prefix
|
|
The output files' names consist of @var{prefix} (@samp{x} by default)
|
|
followed by a group of letters @samp{aa}, @samp{ab}, and so on, such
|
|
that concatenating the output files in sorted order by file name produces
|
|
the original input file. (If more than 676 output files are required,
|
|
@code{split} uses @samp{zaa}, @samp{zab}, etc.)
|
|
|
|
The program accepts the following options. Also see @ref{Common options}.
|
|
|
|
@table @samp
|
|
|
|
@item -@var{lines}
|
|
@itemx -l @var{lines}
|
|
@itemx --lines=@var{lines}
|
|
@opindex -l
|
|
@opindex --lines
|
|
Put @var{lines} lines of @var{input} into each output file.
|
|
|
|
@item -b @var{bytes}
|
|
@itemx --bytes=@var{bytes}
|
|
@opindex -b
|
|
@opindex --bytes
|
|
Put the first @var{bytes} bytes of @var{input} into each output file.
|
|
Appending @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and
|
|
@samp{m} by 1048576.
|
|
|
|
@item -C @var{bytes}
|
|
@itemx --line-bytes=@var{bytes}
|
|
@opindex -C
|
|
@opindex --line-bytes
|
|
Put into each output file as many complete lines of @var{input} as
|
|
possible without exceeding @var{bytes} bytes. For lines longer than
|
|
@var{bytes} bytes, put @var{bytes} bytes into each output file until
|
|
less than @var{bytes} bytes of the line are left, then continue
|
|
normally. @var{bytes} has the same format as for the @samp{--bytes}
|
|
option.
|
|
|
|
@itemx --verbose
|
|
@opindex --verbose
|
|
Write a diagnostic to standard error just before each output file is opened.
|
|
|
|
@end table
|
|
|
|
|
|
@node csplit invocation
|
|
@section @code{csplit}: Split a file into context-determined pieces
|
|
|
|
@pindex csplit
|
|
@cindex context splitting
|
|
@cindex splitting a file into pieces by context
|
|
|
|
@code{csplit} creates zero or more output files containing sections of
|
|
@var{input} (standard input if @var{input} is @samp{-}). Synopsis:
|
|
|
|
@example
|
|
csplit [@var{option}]@dots{} @var{input} @var{pattern}@dots{}
|
|
@end example
|
|
|
|
The contents of the output files are determined by the @var{pattern}
|
|
arguments, as detailed below. An error occurs if a @var{pattern}
|
|
argument refers to a nonexistent line of the input file (e.g., if no
|
|
remaining line matches a given regular expression). After every
|
|
@var{pattern} has been matched, any remaining input is copied into one
|
|
last output file.
|
|
|
|
By default, @code{csplit} prints the number of bytes written to each
|
|
output file after it has been created.
|
|
|
|
The types of pattern arguments are:
|
|
|
|
@table @samp
|
|
|
|
@item @var{n}
|
|
Create an output file containing the input up to but not including line
|
|
@var{n} (a positive integer). If followed by a repeat count, also
|
|
create an output file containing the next @var{line} lines of the input
|
|
file once for each repeat.
|
|
|
|
@item /@var{regexp}/[@var{offset}]
|
|
Create an output file containing the current line up to (but not
|
|
including) the next line of the input file that contains a match for
|
|
@var{regexp}. The optional @var{offset} is a @samp{+} or @samp{-}
|
|
followed by a positive integer. If it is given, the input up to the
|
|
matching line plus or minus @var{offset} is put into the output file,
|
|
and the line after that begins the next section of input.
|
|
|
|
@item %@var{regexp}%[@var{offset}]
|
|
Like the previous type, except that it does not create an output
|
|
file, so that section of the input file is effectively ignored.
|
|
|
|
@item @{@var{repeat-count}@}
|
|
Repeat the previous pattern @var{repeat-count} additional
|
|
times. @var{repeat-count} can either be a positive integer or an
|
|
asterisk, meaning repeat as many times as necessary until the input is
|
|
exhausted.
|
|
|
|
@end table
|
|
|
|
The output files' names consist of a prefix (@samp{xx} by default)
|
|
followed by a suffix. By default, the suffix is an ascending sequence
|
|
of two-digit decimal numbers from @samp{00} and up to @samp{99}. In any
|
|
case, concatenating the output files in sorted order by filename
|
|
produces the original input file.
|
|
|
|
By default, if @code{csplit} encounters an error or receives a hangup,
|
|
interrupt, quit, or terminate signal, it removes any output files
|
|
that it has created so far before it exits.
|
|
|
|
The program accepts the following options. Also see @ref{Common options}.
|
|
|
|
@table @samp
|
|
|
|
@item -f @var{prefix}
|
|
@itemx --prefix=@var{prefix}
|
|
@opindex -f
|
|
@opindex --prefix
|
|
@cindex output file name prefix
|
|
Use @var{prefix} as the output file name prefix.
|
|
|
|
@item -b @var{suffix}
|
|
@itemx --suffix=@var{suffix}
|
|
@opindex -b
|
|
@opindex --suffix
|
|
@cindex output file name suffix
|
|
Use @var{suffix} as the output file name suffix. When this option is
|
|
specified, the suffix string must include exactly one
|
|
@code{printf(3)}-style conversion specification, possibly including
|
|
format specification flags, a field width, a precision specifications,
|
|
or all of these kinds of modifiers. The format letter must convert a
|
|
binary integer argument to readable form; thus, only @samp{d}, @samp{i},
|
|
@samp{u}, @samp{o}, @samp{x}, and @samp{X} conversions are allowed. The
|
|
entire @var{suffix} is given (with the current output file number) to
|
|
@code{sprintf(3)} to form the file name suffixes for each of the
|
|
individual output files in turn. If this option is used, the
|
|
@samp{--digits} option is ignored.
|
|
|
|
@item -n @var{digits}
|
|
@itemx --digits=@var{digits}
|
|
@opindex -n
|
|
@opindex --digits
|
|
Use output file names containing numbers that are @var{digits} digits
|
|
long instead of the default 2.
|
|
|
|
@item -k
|
|
@itemx --keep-files
|
|
@opindex -k
|
|
@opindex --keep-files
|
|
Do not remove output files when errors are encountered.
|
|
|
|
@item -z
|
|
@itemx --elide-empty-files
|
|
@opindex -z
|
|
@opindex --elide-empty-files
|
|
Suppress the generation of zero-length output files. (In cases where
|
|
the section delimiters of the input file are supposed to mark the first
|
|
lines of each of the sections, the first output file will generally be a
|
|
zero-length file unless you use this option.) The output file sequence
|
|
numbers always run consecutively starting from 0, even when this option
|
|
is specified.
|
|
|
|
@item -s
|
|
@itemx -q
|
|
@itemx --silent
|
|
@itemx --quiet
|
|
@opindex -s
|
|
@opindex -q
|
|
@opindex --silent
|
|
@opindex --quiet
|
|
Do not print counts of output file sizes.
|
|
|
|
@end table
|
|
|
|
|
|
@node Summarizing files
|
|
@chapter Summarizing files
|
|
|
|
@cindex summarizing files
|
|
|
|
These commands generate just a few numbers representing entire
|
|
contents of files.
|
|
|
|
@menu
|
|
* wc invocation:: Print byte, word, and line counts.
|
|
* sum invocation:: Print checksum and block counts.
|
|
* cksum invocation:: Print CRC checksum and byte counts.
|
|
* md5sum invocation:: Print or check message-digests.
|
|
@end menu
|
|
|
|
|
|
@node wc invocation
|
|
@section @code{wc}: Print byte, word, and line counts
|
|
|
|
@pindex wc
|
|
@cindex byte count
|
|
@cindex word count
|
|
@cindex line count
|
|
|
|
@code{wc} counts the number of bytes, whitespace-separated words, and
|
|
newlines in each given @var{file}, or standard input if none are given
|
|
or for a @var{file} of @samp{-}. Synopsis:
|
|
|
|
@example
|
|
wc [@var{option}]@dots{} [@var{file}]@dots{}
|
|
@end example
|
|
|
|
@cindex total counts
|
|
@code{wc} prints one line of counts for each file, and if the file was
|
|
given as an argument, it prints the file name following the counts. If
|
|
more than one @var{file} is given, @code{wc} prints a final line
|
|
containing the cumulative counts, with the file name @file{total}. The
|
|
counts are printed in this order: newlines, words, bytes.
|
|
|
|
By default, @code{wc} prints all three counts. Options can specify
|
|
that only certain counts be printed. Options do not undo others
|
|
previously given, so
|
|
|
|
@example
|
|
wc --bytes --words
|
|
@end example
|
|
|
|
@noindent
|
|
prints both the byte counts and the word counts.
|
|
|
|
With the @code{--max-line-length} option, @code{wc} prints the length
|
|
of the longest line per file, and if there is more than one file it
|
|
prints the maximum (not the sum) of those lengths.
|
|
|
|
The program accepts the following options. Also see @ref{Common options}.
|
|
|
|
@table @samp
|
|
|
|
@item -c
|
|
@itemx --bytes
|
|
@itemx --chars
|
|
@opindex -c
|
|
@opindex --bytes
|
|
@opindex --chars
|
|
Print only the byte counts.
|
|
|
|
@item -w
|
|
@itemx --words
|
|
@opindex -w
|
|
@opindex --words
|
|
Print only the word counts.
|
|
|
|
@item -l
|
|
@itemx --lines
|
|
@opindex -l
|
|
@opindex --lines
|
|
Print only the newline counts.
|
|
|
|
@item -L
|
|
@itemx --max-line-length
|
|
@opindex -L
|
|
@opindex --max-line-length
|
|
Print only the maximum line lengths.
|
|
|
|
@end table
|
|
|
|
|
|
@node sum invocation
|
|
@section @code{sum}: Print checksum and block counts
|
|
|
|
@pindex sum
|
|
@cindex 16-bit checksum
|
|
@cindex checksum, 16-bit
|
|
|
|
@code{sum} computes a 16-bit checksum for each given @var{file}, or
|
|
standard input if none are given or for a @var{file} of @samp{-}. Synopsis:
|
|
|
|
@example
|
|
sum [@var{option}]@dots{} [@var{file}]@dots{}
|
|
@end example
|
|
|
|
@code{sum} prints the checksum for each @var{file} followed by the
|
|
number of blocks in the file (rounded up). If more than one @var{file}
|
|
is given, file names are also printed (by default). (With the
|
|
@samp{--sysv} option, corresponding file name are printed when there is
|
|
at least one file argument.)
|
|
|
|
By default, GNU @code{sum} computes checksums using an algorithm
|
|
compatible with BSD @code{sum} and prints file sizes in units of
|
|
1024-byte blocks.
|
|
|
|
The program accepts the following options. Also see @ref{Common options}.
|
|
|
|
@table @samp
|
|
|
|
@item -r
|
|
@opindex -r
|
|
@cindex BSD @code{sum}
|
|
Use the default (BSD compatible) algorithm. This option is included for
|
|
compatibility with the System V @code{sum}. Unless @samp{-s} was also
|
|
given, it has no effect.
|
|
|
|
@item -s
|
|
@itemx --sysv
|
|
@opindex -s
|
|
@opindex --sysv
|
|
@cindex System V @code{sum}
|
|
Compute checksums using an algorithm compatible with System V
|
|
@code{sum}'s default, and print file sizes in units of 512-byte blocks.
|
|
|
|
@end table
|
|
|
|
@code{sum} is provided for compatibility; the @code{cksum} program (see
|
|
next section) is preferable in new applications.
|
|
|
|
|
|
@node cksum invocation
|
|
@section @code{cksum}: Print CRC checksum and byte counts
|
|
|
|
@pindex cksum
|
|
@cindex cyclic redundancy check
|
|
@cindex CRC checksum
|
|
|
|
@code{cksum} computes a cyclic redundancy check (CRC) checksum for each
|
|
given @var{file}, or standard input if none are given or for a
|
|
@var{file} of @samp{-}. Synopsis:
|
|
|
|
@example
|
|
cksum [@var{option}]@dots{} [@var{file}]@dots{}
|
|
@end example
|
|
|
|
@code{cksum} prints the CRC checksum for each file along with the number
|
|
of bytes in the file, and the filename unless no arguments were given.
|
|
|
|
@code{cksum} is typically used to ensure that files
|
|
transferred by unreliable means (e.g., netnews) have not been corrupted,
|
|
by comparing the @code{cksum} output for the received files with the
|
|
@code{cksum} output for the original files (typically given in the
|
|
distribution).
|
|
|
|
The CRC algorithm is specified by the @sc{POSIX.2} standard. It is not
|
|
compatible with the BSD or System V @code{sum} algorithms (see the
|
|
previous section); it is more robust.
|
|
|
|
The only options are @samp{--help} and @samp{--version}. @xref{Common
|
|
options}.
|
|
|
|
|
|
@node md5sum invocation
|
|
@section @code{md5sum}: Print or check message-digests
|
|
|
|
@pindex md5sum
|
|
@cindex 128-bit checksum
|
|
@cindex checksum, 128-bit
|
|
@cindex fingerprint, 128-bit
|
|
@cindex message-digest, 128-bit
|
|
|
|
@code{md5sum} computes a 128-bit checksum (or @dfn{fingerprint} or
|
|
@dfn{message-digest}) for each specified @var{file}.
|
|
If a @var{file} is specified as @samp{-} or if no files are given
|
|
@code{md5sum} computes the checksum for the standard input.
|
|
@code{md5sum} can also determine whether a file and checksum are
|
|
consistent. Synopses:
|
|
|
|
@example
|
|
md5sum [@var{option}]@dots{} [@var{file}]@dots{}
|
|
md5sum [@var{option}]@dots{} --check [@var{file}]
|
|
@end example
|
|
|
|
For each @var{file}, @samp{md5sum} outputs the MD5 checksum, a flag
|
|
indicating a binary or text input file, and the filename.
|
|
If @var{file} is omitted or specified as @samp{-}, standard input is read.
|
|
|
|
The program accepts the following options. Also see @ref{Common options}.
|
|
|
|
@table @samp
|
|
|
|
@item -b
|
|
@itemx --binary
|
|
@opindex -b
|
|
@opindex --binary
|
|
@cindex binary input files
|
|
Treat all input files as binary. This option has no effect on Unix
|
|
systems, since they don't distinguish between binary and text files.
|
|
This option is useful on systems that have different internal and
|
|
external character representations. On MS-DOS and MS-Windows, this is
|
|
the default.
|
|
|
|
@item -c
|
|
@itemx --check
|
|
Read filenames and checksum information from the single @var{file}
|
|
(or from stdin if no @var{file} was specified) and report whether
|
|
each named file and the corresponding checksum data are consistent.
|
|
The input to this mode of @code{md5sum} is usually the output of
|
|
a prior, checksum-generating run of @samp{md5sum}.
|
|
Each valid line of input consists of an MD5 checksum, a binary/text
|
|
flag, and then a filename.
|
|
Binary files are marked with @samp{*}, text with @samp{ }.
|
|
For each such line, @code{md5sum} reads the named file and computes its
|
|
MD5 checksum. Then, if the computed message digest does not match the
|
|
one on the line with the filename, the file is noted as having
|
|
failed the test. Otherwise, the file passes the test.
|
|
By default, for each valid line, one line is written to standard
|
|
output indicating whether the named file passed the test.
|
|
After all checks have been performed, if there were any failures,
|
|
a warning is issued to standard error.
|
|
Use the @samp{--status} option to inhibit that output.
|
|
If any listed file cannot be opened or read, if any valid line has
|
|
an MD5 checksum inconsistent with the associated file, or if no valid
|
|
line is found, @code{md5sum} exits with nonzero status. Otherwise,
|
|
it exits successfully.
|
|
|
|
@itemx --status
|
|
@opindex --status
|
|
@cindex verifying MD5 checksums
|
|
This option is useful only when verifying checksums.
|
|
When verifying checksums, don't generate the default one-line-per-file
|
|
diagnostic and don't output the warning summarizing any failures.
|
|
Failures to open or read a file still evoke individual diagnostics to
|
|
standard error.
|
|
If all listed files are readable and are consistent with the associated
|
|
MD5 checksums, exit successfully. Otherwise exit with a status code
|
|
indicating there was a failure.
|
|
|
|
@item -t
|
|
@itemx --text
|
|
@opindex -t
|
|
@opindex --text
|
|
@cindex text input files
|
|
Treat all input files as text files. This is the reverse of
|
|
@samp{--binary}.
|
|
|
|
@item -w
|
|
@itemx --warn
|
|
@opindex -w
|
|
@opindex --warn
|
|
@cindex verifying MD5 checksums
|
|
When verifying checksums, warn about improperly formatted MD5 checksum lines.
|
|
This option is useful only if all but a few lines in the checked input
|
|
are valid.
|
|
|
|
@end table
|
|
|
|
|
|
@node Operating on sorted files
|
|
@chapter Operating on sorted files
|
|
|
|
@cindex operating on sorted files
|
|
@cindex sorted files, operations on
|
|
|
|
These commands work with (or produce) sorted files.
|
|
|
|
@menu
|
|
* sort invocation:: Sort text files.
|
|
* uniq invocation:: Uniqify files.
|
|
* comm invocation:: Compare two sorted files line by line.
|
|
* ptx invocation::
|
|
@end menu
|
|
|
|
|
|
@node sort invocation
|
|
@section @code{sort}: Sort text files
|
|
|
|
@pindex sort
|
|
@cindex sorting files
|
|
|
|
@code{sort} sorts, merges, or compares all the lines from the given
|
|
files, or standard input if none are given or for a @var{file} of
|
|
@samp{-}. By default, @code{sort} writes the results to standard
|
|
output. Synopsis:
|
|
|
|
@example
|
|
sort [@var{option}]@dots{} [@var{file}]@dots{}
|
|
@end example
|
|
|
|
@code{sort} has three modes of operation: sort (the default), merge,
|
|
and check for sortedness. The following options change the operation
|
|
mode:
|
|
|
|
@table @samp
|
|
|
|
@item -c
|
|
@opindex -c
|
|
@cindex checking for sortedness
|
|
Check whether the given files are already sorted: if they are not all
|
|
sorted, print an error message and exit with a status of 1.
|
|
Otherwise, exit successfully.
|
|
|
|
@item -m
|
|
@opindex -m
|
|
@cindex merging sorted files
|
|
Merge the given files by sorting them as a group. Each input file must
|
|
always be individually sorted. It always works to sort instead of
|
|
merge; merging is provided because it is faster, in the case where it
|
|
works.
|
|
|
|
@end table
|
|
|
|
A pair of lines is compared as follows: if any key fields have been
|
|
specified, @code{sort} compares each pair of fields, in the order
|
|
specified on the command line, according to the associated ordering
|
|
options, until a difference is found or no fields are left.
|
|
|
|
If any of the global options @samp{Mbdfinr} are given but no key fields
|
|
are specified, @code{sort} compares the entire lines according to the
|
|
global options.
|
|
|
|
Finally, as a last resort when all keys compare equal (or if no
|
|
ordering options were specified at all), @code{sort} compares the lines
|
|
byte by byte in machine collating sequence. The last resort comparison
|
|
honors the @samp{-r} global option. The @samp{-s} (stable) option
|
|
disables this last-resort comparison so that lines in which all fields
|
|
compare equal are left in their original relative order. If no fields
|
|
or global options are specified, @samp{-s} has no effect.
|
|
|
|
GNU @code{sort} (as specified for all GNU utilities) has no limits on
|
|
input line length or restrictions on bytes allowed within lines. In
|
|
addition, if the final byte of an input file is not a newline, GNU
|
|
@code{sort} silently supplies one.
|
|
|
|
Upon any error, @code{sort} exits with a status of @samp{2}.
|
|
|
|
@vindex TMPDIR
|
|
If the environment variable @code{TMPDIR} is set, @code{sort} uses its
|
|
value as the directory for temporary files instead of @file{/tmp}. The
|
|
@samp{-T @var{tempdir}} option in turn overrides the environment
|
|
variable.
|
|
|
|
The following options affect the ordering of output lines. They may be
|
|
specified globally or as part of a specific key field. If no key
|
|
fields are specified, global options apply to comparison of entire
|
|
lines; otherwise the global options are inherited by key fields that do
|
|
not specify any special options of their own.
|
|
|
|
@table @samp
|
|
|
|
@item -b
|
|
@opindex -b
|
|
@cindex blanks, ignoring leading
|
|
Ignore leading blanks when finding sort keys in each line.
|
|
|
|
@item -d
|
|
@opindex -d
|
|
@cindex phone directory order
|
|
@cindex telephone directory order
|
|
Sort in @dfn{phone directory} order: ignore all characters except
|
|
letters, digits and blanks when sorting.
|
|
|
|
@item -f
|
|
@opindex -f
|
|
@cindex case folding
|
|
Fold lowercase characters into the equivalent uppercase characters when
|
|
sorting so that, for example, @samp{b} and @samp{B} sort as equal.
|
|
|
|
@item -g
|
|
@opindex -g
|
|
@cindex general numeric sort
|
|
Sort numerically, but use strtod(3) to arrive at the numeric values.
|
|
This allows floating point numbers to be specified in scientific notation,
|
|
like @code{1.0e-34} and @code{10e100}. Use this option only if there
|
|
is no alternative; it is much slower than @samp{-n} and numbers with
|
|
too many significant digits will be compared as if they had been
|
|
truncated. In addition, numbers outside the range of representable
|
|
double precision floating point numbers are treated as if they were
|
|
zeroes; overflow and underflow are not reported.
|
|
|
|
@item -i
|
|
@opindex -i
|
|
@cindex unprintable characters, ignoring
|
|
Ignore characters outside the printable ASCII range 040-0176 octal
|
|
(inclusive) when sorting.
|
|
|
|
@item -M
|
|
@opindex -M
|
|
@cindex months, sorting by
|
|
An initial string, consisting of any amount of whitespace, followed
|
|
by three letters abbreviating a month name, is folded to UPPER case and
|
|
compared in the order @samp{JAN} < @samp{FEB} < @dots{} < @samp{DEC}.
|
|
Invalid names compare low to valid names.
|
|
|
|
@item -n
|
|
@opindex -n
|
|
@cindex numeric sort
|
|
Sort numerically: the number begins each line; specifically, it consists
|
|
of optional whitespace, an optional @samp{-} sign, and zero or more
|
|
digits, optionally followed by a decimal point and zero or more digits.
|
|
|
|
@code{sort -n} uses what might be considered an unconventional method
|
|
to compare strings representing floating point numbers. Rather than
|
|
first converting each string to the C @code{double} type and then
|
|
comparing those values, sort aligns the decimal points in the two
|
|
strings and compares the strings a character at a time. One benefit
|
|
of using this approach is its speed. In practice this is much more
|
|
efficient than performing the two corresponding string-to-double (or even
|
|
string-to-integer) conversions and then comparing doubles. In addition,
|
|
there is no corresponding loss of precision. Converting each string to
|
|
@code{double} before comparison would limit precision to about 16 digits
|
|
on most systems.
|
|
|
|
Neither a leading @samp{+} nor exponential notation is recognized.
|
|
To compare such strings numerically, use the @samp{-g} option.
|
|
|
|
@item -r
|
|
@opindex -r
|
|
@cindex reverse sorting
|
|
Reverse the result of comparison, so that lines with greater key values
|
|
appear earlier in the output instead of later.
|
|
|
|
@end table
|
|
|
|
Other options are:
|
|
|
|
@table @samp
|
|
|
|
@item -o @var{output-file}
|
|
@opindex -o
|
|
@cindex overwriting of input, allowed
|
|
Write output to @var{output-file} instead of standard output.
|
|
If @var{output-file} is one of the input files, @code{sort} copies
|
|
it to a temporary file before sorting and writing the output to
|
|
@var{output-file}.
|
|
|
|
@item -t @var{separator}
|
|
@opindex -t
|
|
@cindex field separator character
|
|
Use character @var{separator} as the field separator when finding the
|
|
sort keys in each line. By default, fields are separated by the empty
|
|
string between a non-whitespace character and a whitespace character.
|
|
That is, given the input line @w{@samp{ foo bar}}, @code{sort} breaks it
|
|
into fields @w{@samp{ foo}} and @w{@samp{ bar}}. The field separator is
|
|
not considered to be part of either the field preceding or the field
|
|
following.
|
|
|
|
@item -u
|
|
@opindex -u
|
|
@cindex uniqifying output
|
|
For the default case or the @samp{-m} option, only output the first
|
|
of a sequence of lines that compare equal. For the @samp{-c} option,
|
|
check that no pair of consecutive lines compares equal.
|
|
|
|
@item -k @var{pos1}[,@var{pos2}]
|
|
@opindex -k
|
|
@cindex sort field
|
|
The recommended, @sc{POSIX}, option for specifying a sort field. The field
|
|
consists of the line between @var{pos1} and @var{pos2} (or the end of
|
|
the line, if @var{pos2} is omitted), inclusive. Fields and character
|
|
positions are numbered starting with 1. See below.
|
|
|
|
@item -z
|
|
@opindex -z
|
|
@cindex sort zero-terminated lines
|
|
Treat the input as a set of lines, each terminated by a zero byte (@sc{ASCII}
|
|
@sc{NUL} (Null) character) instead of a @sc{ASCII} @sc{LF} (Line Feed.)
|
|
This option can be useful in conjunction with @samp{perl -0} or
|
|
@samp{find -print0} and @samp{xargs -0} which do the same in order to
|
|
reliably handle arbitrary pathnames (even those which contain Line Feed
|
|
characters.)
|
|
|
|
@item +@var{pos1}[-@var{pos2}]
|
|
The obsolete, traditional option for specifying a sort field. The field
|
|
consists of the line between @var{pos1} and up to but @emph{not including}
|
|
@var{pos2} (or the end of the line if @var{pos2} is omitted). Fields
|
|
and character positions are numbered starting with 0. See below.
|
|
|
|
@end table
|
|
|
|
In addition, when GNU @code{sort} is invoked with exactly one argument,
|
|
options @samp{--help} and @samp{--version} are recognized. @xref{Common
|
|
options}.
|
|
|
|
Historical (BSD and System V) implementations of @code{sort} have
|
|
differed in their interpretation of some options, particularly
|
|
@samp{-b}, @samp{-f}, and @samp{-n}. GNU sort follows the @sc{POSIX}
|
|
behavior, which is usually (but not always!) like the System V behavior.
|
|
According to @sc{POSIX}, @samp{-n} no longer implies @samp{-b}. For
|
|
consistency, @samp{-M} has been changed in the same way. This may
|
|
affect the meaning of character positions in field specifications in
|
|
obscure cases. The only fix is to add an explicit @samp{-b}.
|
|
|
|
A position in a sort field specified with the @samp{-k} or @samp{+}
|
|
option has the form @samp{@var{f}.@var{c}}, where @var{f} is the number
|
|
of the field to use and @var{c} is the number of the first character
|
|
from the beginning of the field (for @samp{+@var{pos}}) or from the end
|
|
of the previous field (for @samp{-@var{pos}}). If the @samp{.@var{c}}
|
|
is omitted, it is taken to be the first character in the field. If the
|
|
@samp{-b} option was specified, the @samp{.@var{c}} part of a field
|
|
specification is counted from the first nonblank character of the field
|
|
(for @samp{+@var{pos}}) or from the first nonblank character following
|
|
the previous field (for @samp{-@var{pos}}).
|
|
|
|
A sort key option may also have any of the option letters @samp{Mbdfinr}
|
|
appended to it, in which case the global ordering options are not used
|
|
for that particular field. The @samp{-b} option may be independently
|
|
attached to either or both of the @samp{+@var{pos}} and
|
|
@samp{-@var{pos}} parts of a field specification, and if it is inherited
|
|
from the global options it will be attached to both.
|
|
Keys may span multiple fields.
|
|
|
|
Here are some examples to illustrate various combinations of options.
|
|
In them, the @sc{POSIX} @samp{-k} option is used to specify sort keys rather
|
|
than the obsolete @samp{+@var{pos1}-@var{pos2}} syntax.
|
|
|
|
@itemize @bullet
|
|
|
|
@item
|
|
Sort in descending (reverse) numeric order.
|
|
|
|
@example
|
|
sort -nr
|
|
@end example
|
|
|
|
Sort alphabetically, omitting the first and second fields.
|
|
This uses a single key composed of the characters beginning
|
|
at the start of field three and extending to the end of each line.
|
|
|
|
@example
|
|
sort -k3
|
|
@end example
|
|
|
|
@item
|
|
Sort numerically on the second field and resolve ties by sorting
|
|
alphabetically on the third and fourth characters of field five.
|
|
Use @samp{:} as the field delimiter.
|
|
|
|
@example
|
|
sort -t : -k 2,2n -k 5.3,5.4
|
|
@end example
|
|
|
|
Note that if you had written @samp{-k 2} instead of @samp{-k 2,2}
|
|
@samp{sort} would have used all characters beginning in the second field
|
|
and extending to the end of the line as the primary @emph{numeric}
|
|
key. For the large majority of applications, treating keys spanning
|
|
more than one field as numeric will not do what you expect.
|
|
|
|
Also note that the @samp{n} modifier was applied to the field-end
|
|
specifier for the first key. It would have been equivalent to
|
|
specify @samp{-k 2n,2} or @samp{-k 2n,2n}. All modifiers except
|
|
@samp{b} apply to the associated @emph{field}, regardless of whether
|
|
the modifier character is attached to the field-start and/or the
|
|
field-end part of the key specifier.
|
|
|
|
@item
|
|
Sort the password file on the fifth field and ignore any
|
|
leading white space. Sort lines with equal values in field five
|
|
on the numeric user ID in field three.
|
|
|
|
@example
|
|
sort -t : -k 5b,5 -k 3,3n /etc/passwd
|
|
@end example
|
|
|
|
An alternative is to use the global numeric modifier @samp{-n}.
|
|
|
|
@example
|
|
sort -t : -n -k 5b,5 -k 3,3 /etc/passwd
|
|
@end example
|
|
|
|
@item
|
|
Generate a tags file in case insensitive sorted order.
|
|
@example
|
|
find src -type f -print0 | sort -t / -z -f | xargs -0 etags --append
|
|
@end example
|
|
|
|
The use of @samp{-print0}, @samp{-z}, and @samp{-0} in this case mean
|
|
that pathnames that contain Line Feed characters will not get broken up
|
|
by the sort operation.
|
|
|
|
Finally, to ignore both leading and trailing white space, you
|
|
could have applied the @samp{b} modifier to the field-end specifier
|
|
for the first key,
|
|
|
|
@example
|
|
sort -t : -n -k 5b,5b -k 3,3 /etc/passwd
|
|
@end example
|
|
|
|
or by using the global @samp{-b} modifier instead of @samp{-n}
|
|
and an explicit @samp{n} with the second key specifier.
|
|
|
|
@example
|
|
sort -t : -b -k 5,5 -k 3,3n /etc/passwd
|
|
@end example
|
|
|
|
@c This example is a bit contrived and needs more explanation.
|
|
@c @item
|
|
@c Sort records separated by an arbitrary string by using a pipe to convert
|
|
@c each record delimiter string to @samp{\0}, then using sort's -z option,
|
|
@c and converting each @samp{\0} back to the original record delimiter.
|
|
@c
|
|
@c @example
|
|
@c printf 'c\n\nb\n\na\n'|perl -0pe 's/\n\n/\n\0/g'|sort -z|perl -0pe 's/\0/\n/g'
|
|
@c @end example
|
|
|
|
@end itemize
|
|
|
|
|
|
@node uniq invocation
|
|
@section @code{uniq}: Uniqify files
|
|
|
|
@pindex uniq
|
|
@cindex uniqify files
|
|
|
|
@code{uniq} writes the unique lines in the given @file{input}, or
|
|
standard input if nothing is given or for an @var{input} name of
|
|
@samp{-}. Synopsis:
|
|
|
|
@example
|
|
uniq [@var{option}]@dots{} [@var{input} [@var{output}]]
|
|
@end example
|
|
|
|
By default, @code{uniq} prints the unique lines in a sorted file, i.e.,
|
|
discards all but one of identical successive lines. Optionally, it can
|
|
instead show only lines that appear exactly once, or lines that appear
|
|
more than once.
|
|
|
|
The input must be sorted. If your input is not sorted, perhaps you want
|
|
to use @code{sort -u}.
|
|
|
|
If no @var{output} file is specified, @code{uniq} writes to standard
|
|
output.
|
|
|
|
The program accepts the following options. Also see @ref{Common options}.
|
|
|
|
@table @samp
|
|
|
|
@item -@var{n}
|
|
@itemx -f @var{n}
|
|
@itemx --skip-fields=@var{n}
|
|
@opindex -@var{n}
|
|
@opindex -f
|
|
@opindex --skip-fields
|
|
Skip @var{n} fields on each line before checking for uniqueness. Fields
|
|
are sequences of non-space non-tab characters that are separated from
|
|
each other by at least one spaces or tabs.
|
|
|
|
@item +@var{n}
|
|
@itemx -s @var{n}
|
|
@itemx --skip-chars=@var{n}
|
|
@opindex +@var{n}
|
|
@opindex -s
|
|
@opindex --skip-chars
|
|
Skip @var{n} characters before checking for uniqueness. If you use both
|
|
the field and character skipping options, fields are skipped over first.
|
|
|
|
@item -c
|
|
@itemx --count
|
|
@opindex -c
|
|
@opindex --count
|
|
Print the number of times each line occurred along with the line.
|
|
|
|
@item -i
|
|
@itemx --ignore-case
|
|
@opindex -i
|
|
@opindex --ignore-case
|
|
Ignore differences in case when comparing lines.
|
|
|
|
@item -d
|
|
@itemx --repeated
|
|
@opindex -d
|
|
@opindex --repeated
|
|
@cindex duplicate lines, outputting
|
|
Print only duplicate lines.
|
|
|
|
@item -u
|
|
@itemx --unique
|
|
@opindex -u
|
|
@opindex --unique
|
|
@cindex unique lines, outputting
|
|
Print only unique lines.
|
|
|
|
@item -w @var{n}
|
|
@itemx --check-chars=@var{n}
|
|
@opindex -w
|
|
@opindex --check-chars
|
|
Compare @var{n} characters on each line (after skipping any specified
|
|
fields and characters). By default the entire rest of the lines are
|
|
compared.
|
|
|
|
@end table
|
|
|
|
|
|
@node comm invocation
|
|
@section @code{comm}: Compare two sorted files line by line
|
|
|
|
@pindex comm
|
|
@cindex line-by-line comparison
|
|
@cindex comparing sorted files
|
|
|
|
@code{comm} writes to standard output lines that are common, and lines
|
|
that are unique, to two input files; a file name of @samp{-} means
|
|
standard input. Synopsis:
|
|
|
|
@example
|
|
comm [@var{option}]@dots{} @var{file1} @var{file2}
|
|
@end example
|
|
|
|
The input files must be sorted before @code{comm} can be used.
|
|
|
|
@cindex differing lines
|
|
@cindex common lines
|
|
With no options, @code{comm} produces three column output. Column one
|
|
contains lines unique to @var{file1}, column two contains lines unique
|
|
to @var{file2}, and column three contains lines common to both files.
|
|
Columns are separated by @key{TAB}.
|
|
@c FIXME: when there's an option to supply an alternative separator
|
|
@c string, append `by default' to the above sentence.
|
|
|
|
@opindex -1
|
|
@opindex -2
|
|
@opindex -3
|
|
The options @samp{-1}, @samp{-2}, and @samp{-3} suppress printing of
|
|
the corresponding columns. Also see @ref{Common options}.
|
|
|
|
Unlike some other comparison utilities, @code{comm} has an exit
|
|
status that does not depend on the result of the comparison.
|
|
Upon normal completion @code{comm} produces an exit code of zero.
|
|
If there is an error it exits with nonzero status.
|
|
|
|
|
|
@node ptx invocation
|
|
@section @code{ptx}: Produce permuted indexes
|
|
|
|
@pindex ptx
|
|
|
|
@code{ptx} reads a text file and essentially produces a permuted index, with
|
|
each keyword in its context. The calling sketch is either one of:
|
|
|
|
@example
|
|
ptx [@var{option} @dots{}] [@var{file} @dots{}]
|
|
ptx -G [@var{option} @dots{}] [@var{input} [@var{output}]]
|
|
@end example
|
|
|
|
The @samp{-G} (or its equivalent: @samp{--traditional}) option disables
|
|
all GNU extensions and revert to traditional mode, thus introducing some
|
|
limitations, and changes several of the program's default option values.
|
|
When @samp{-G} is not specified, GNU extensions are always enabled. GNU
|
|
extensions to @code{ptx} are documented wherever appropriate in this
|
|
document. For the full list, see @xref{Compatibility in ptx}.
|
|
|
|
Individual options are explained in incoming sections.
|
|
|
|
When GNU extensions are enabled, there may be zero, one or several
|
|
@var{file} after the options. If there is no @var{file}, the program
|
|
reads the standard input. If there is one or several @var{file}, they
|
|
give the name of input files which are all read in turn, as if all the
|
|
input files were concatenated. However, there is a full contextual
|
|
break between each file and, when automatic referencing is requested,
|
|
file names and line numbers refer to individual text input files. In
|
|
all cases, the program produces the permuted index onto the standard
|
|
output.
|
|
|
|
When GNU extensions are @emph{not} enabled, that is, when the program
|
|
operates in traditional mode, there may be zero, one or two parameters
|
|
besides the options. If there is no parameters, the program reads the
|
|
standard input and produces the permuted index onto the standard output.
|
|
If there is only one parameter, it names the text @var{input} to be read
|
|
instead of the standard input. If two parameters are given, they give
|
|
respectively the name of the @var{input} file to read and the name of
|
|
the @var{output} file to produce. @emph{Be very careful} to note that,
|
|
in this case, the contents of file given by the second parameter is
|
|
destroyed. This behaviour is dictated only by System V @code{ptx}
|
|
compatibility, because GNU Standards discourage output parameters not
|
|
introduced by an option.
|
|
|
|
Note that for @emph{any} file named as the value of an option or as an
|
|
input text file, a single dash @kbd{-} may be used, in which case
|
|
standard input is assumed. However, it would not make sense to use this
|
|
convention more than once per program invocation.
|
|
|
|
@menu
|
|
* General options in ptx:: Options which affect general program behaviour.
|
|
* Charset selection in ptx:: Underlying character set considerations.
|
|
* Input processing in ptx:: Input fields, contexts, and keyword selection.
|
|
* Output formatting in ptx:: Types of output format, and sizing the fields.
|
|
* Compatibility in ptx::
|
|
@end menu
|
|
|
|
|
|
@node General options in ptx
|
|
@subsection General options
|
|
|
|
@table @code
|
|
|
|
@item -C
|
|
@itemx --copyright
|
|
Prints a short note about the Copyright and copying conditions, then
|
|
exit without further processing.
|
|
|
|
@item -G
|
|
@itemx --traditional
|
|
As already explained, this option disables all GNU extensions to
|
|
@code{ptx} and switch to traditional mode.
|
|
|
|
@item --help
|
|
Prints a short help on standard output, then exit without further
|
|
processing.
|
|
|
|
@item --version
|
|
Prints the program verison on standard output, then exit without further
|
|
processing.
|
|
|
|
@end table
|
|
|
|
|
|
@node Charset selection in ptx
|
|
@subsection Charset selection
|
|
|
|
As it is setup now, the program assumes that the input file is coded
|
|
using 8-bit ISO 8859-1 code, also known as Latin-1 character set,
|
|
@emph{unless} if it is compiled for MS-DOS, in which case it uses the
|
|
character set of the IBM-PC. (GNU @code{ptx} is not known to work on
|
|
smaller MS-DOS machines anymore.) Compared to 7-bit ASCII, the set of
|
|
characters which are letters is then different, this fact alters the
|
|
behaviour of regular expression matching. Thus, the default regular
|
|
expression for a keyword allows foreign or diacriticized letters.
|
|
Keyword sorting, however, is still crude; it obeys the underlying
|
|
character set ordering quite blindly.
|
|
|
|
@table @code
|
|
|
|
@item -f
|
|
@itemx --ignore-case
|
|
Fold lower case letters to upper case for sorting.
|
|
|
|
@end table
|
|
|
|
|
|
@node Input processing in ptx
|
|
@subsection Word selection and input processing
|
|
|
|
@table @code
|
|
|
|
@item -b @var{file}
|
|
@item --break-file=@var{file}
|
|
|
|
This option is an alternative way to option @code{-W} for describing
|
|
which characters make up words. This option introduces the name of a
|
|
file which contains a list of characters which can@emph{not} be part of
|
|
one word, this file is called the @dfn{Break file}. Any character which
|
|
is not part of the Break file is a word constituent. If both options
|
|
@code{-b} and @code{-W} are specified, then @code{-W} has precedence and
|
|
@code{-b} is ignored.
|
|
|
|
When GNU extensions are enabled, the only way to avoid newline as a
|
|
break character is to write all the break characters in the file with no
|
|
newline at all, not even at the end of the file. When GNU extensions
|
|
are disabled, spaces, tabs and newlines are always considered as break
|
|
characters even if not included in the Break file.
|
|
|
|
@item -i @var{file}
|
|
@itemx --ignore-file=@var{file}
|
|
|
|
The file associated with this option contains a list of words which will
|
|
never be taken as keywords in concordance output. It is called the
|
|
@dfn{Ignore file}. The file contains exactly one word in each line; the
|
|
end of line separation of words is not subject to the value of the
|
|
@code{-S} option.
|
|
|
|
There is a default Ignore file used by @code{ptx} when this option is
|
|
not specified, usually found in @file{/usr/local/lib/eign} if this has
|
|
not been changed at installation time. If you want to deactivate the
|
|
default Ignore file, specify @code{/dev/null} instead.
|
|
|
|
@item -o @var{file}
|
|
@itemx --only-file=@var{file}
|
|
|
|
The file associated with this option contains a list of words which will
|
|
be retained in concordance output, any word not mentioned in this file
|
|
is ignored. The file is called the @dfn{Only file}. The file contains
|
|
exactly one word in each line; the end of line separation of words is
|
|
not subject to the value of the @code{-S} option.
|
|
|
|
There is no default for the Only file. In the case there are both an
|
|
Only file and an Ignore file, a word will be subject to be a keyword
|
|
only if it is given in the Only file and not given in the Ignore file.
|
|
|
|
@item -r
|
|
@itemx --references
|
|
|
|
On each input line, the leading sequence of non white characters will be
|
|
taken to be a reference that has the purpose of identifying this input
|
|
line on the produced permuted index. For more information about reference
|
|
production, see @xref{Output formatting in ptx}.
|
|
Using this option changes the default value for option @code{-S}.
|
|
|
|
Using this option, the program does not try very hard to remove
|
|
references from contexts in output, but it succeeds in doing so
|
|
@emph{when} the context ends exactly at the newline. If option
|
|
@code{-r} is used with @code{-S} default value, or when GNU extensions
|
|
are disabled, this condition is always met and references are completely
|
|
excluded from the output contexts.
|
|
|
|
@item -S @var{regexp}
|
|
@itemx --sentence-regexp=@var{regexp}
|
|
|
|
This option selects which regular expression will describe the end of a
|
|
line or the end of a sentence. In fact, there is other distinction
|
|
between end of lines or end of sentences than the effect of this regular
|
|
expression, and input line boundaries have no special significance
|
|
outside this option. By default, when GNU extensions are enabled and if
|
|
@code{-r} option is not used, end of sentences are used. In this
|
|
case, the precise @var{regex} is imported from GNU emacs:
|
|
|
|
@example
|
|
[.?!][]\"')@}]*\\($\\|\t\\| \\)[ \t\n]*
|
|
@end example
|
|
|
|
Whenever GNU extensions are disabled or if @code{-r} option is used, end
|
|
of lines are used; in this case, the default @var{regexp} is just:
|
|
|
|
@example
|
|
\n
|
|
@end example
|
|
|
|
Using an empty REGEXP is equivalent to completely disabling end of line or end
|
|
of sentence recognition. In this case, the whole file is considered to
|
|
be a single big line or sentence. The user might want to disallow all
|
|
truncation flag generation as well, through option @code{-F ""}.
|
|
@xref{Regexps, , Syntax of Regular Expressions, emacs, The GNU Emacs
|
|
Manual}.
|
|
|
|
When the keywords happen to be near the beginning of the input line or
|
|
sentence, this often creates an unused area at the beginning of the
|
|
output context line; when the keywords happen to be near the end of the
|
|
input line or sentence, this often creates an unused area at the end of
|
|
the output context line. The program tries to fill those unused areas
|
|
by wrapping around context in them; the tail of the input line or
|
|
sentence is used to fill the unused area on the left of the output line;
|
|
the head of the input line or sentence is used to fill the unused area
|
|
on the right of the output line.
|
|
|
|
As a matter of convenience to the user, many usual backslashed escape
|
|
sequences, as found in the C language, are recognized and converted to
|
|
the corresponding characters by @code{ptx} itself.
|
|
|
|
@item -W @var{regexp}
|
|
@itemx --word-regexp=@var{regexp}
|
|
|
|
This option selects which regular expression will describe each keyword.
|
|
By default, if GNU extensions are enabled, a word is a sequence of
|
|
letters; the @var{regexp} used is @code{\w+}. When GNU extensions are
|
|
disabled, a word is by default anything which ends with a space, a tab
|
|
or a newline; the @var{regexp} used is @code{[^ \t\n]+}.
|
|
|
|
An empty REGEXP is equivalent to not using this option, letting the
|
|
default dive in. @xref{Regexps, , Syntax of Regular Expressions, emacs,
|
|
The GNU Emacs Manual}.
|
|
|
|
As a matter of convenience to the user, many usual backslashed escape
|
|
sequences, as found in the C language, are recognized and converted to
|
|
the corresponding characters by @code{ptx} itself.
|
|
|
|
@end table
|
|
|
|
|
|
@node Output formatting in ptx
|
|
@subsection Output formatting
|
|
|
|
Output format is mainly controlled by @code{-O} and @code{-T} options,
|
|
described in the table below. When neither @code{-O} nor @code{-T} is
|
|
selected, and if GNU extensions are enabled, the program choose an
|
|
output format suited for a dumb terminal. Each keyword occurrence is
|
|
output to the center of one line, surrounded by its left and right
|
|
contexts. Each field is properly justified, so the concordance output
|
|
could readily be observed. As a special feature, if automatic
|
|
references are selected by option @code{-A} and are output before the
|
|
left context, that is, if option @code{-R} is @emph{not} selected, then
|
|
a colon is added after the reference; this nicely interfaces with GNU
|
|
Emacs @code{next-error} processing. In this default output format, each
|
|
white space character, like newline and tab, is merely changed to
|
|
exactly one space, with no special attempt to compress consecutive
|
|
spaces. This might change in the future. Except for those white space
|
|
characters, every other character of the underlying set of 256
|
|
characters is transmitted verbatim.
|
|
|
|
Output format is further controlled by the following options.
|
|
|
|
@table @code
|
|
|
|
@item -g @var{number}
|
|
@itemx --gap-size=@var{number}
|
|
|
|
Select the size of the minimum white gap between the fields on the output
|
|
line.
|
|
|
|
@item -w @var{number}
|
|
@itemx --width=@var{number}
|
|
|
|
Select the output maximum width of each final line. If references are
|
|
used, they are included or excluded from the output maximum width
|
|
depending on the value of option @code{-R}. If this option is not
|
|
selected, that is, when references are output before the left context,
|
|
the output maximum width takes into account the maximum length of all
|
|
references. If this options is selected, that is, when references are
|
|
output after the right context, the output maximum width does not take
|
|
into account the space taken by references, nor the gap that precedes
|
|
them.
|
|
|
|
@item -A
|
|
@itemx --auto-reference
|
|
|
|
Select automatic references. Each input line will have an automatic
|
|
reference made up of the file name and the line ordinal, with a single
|
|
colon between them. However, the file name will be empty when standard
|
|
input is being read. If both @code{-A} and @code{-r} are selected, then
|
|
the input reference is still read and skipped, but the automatic
|
|
reference is used at output time, overriding the input reference.
|
|
|
|
@item -R
|
|
@itemx --right-side-refs
|
|
|
|
In default output format, when option @code{-R} is not used, any
|
|
reference produced by the effect of options @code{-r} or @code{-A} are
|
|
given to the far right of output lines, after the right context. In
|
|
default output format, when option @code{-R} is specified, references
|
|
are rather given to the beginning of each output line, before the left
|
|
context. For any other output format, option @code{-R} is almost
|
|
ignored, except for the fact that the width of references is @emph{not}
|
|
taken into account in total output width given by @code{-w} whenever
|
|
@code{-R} is selected.
|
|
|
|
This option is automatically selected whenever GNU extensions are
|
|
disabled.
|
|
|
|
@item -F @var{string}
|
|
@itemx --flac-truncation=@var{string}
|
|
|
|
This option will request that any truncation in the output be reported
|
|
using the string @var{string}. Most output fields theoretically extend
|
|
towards the beginning or the end of the current line, or current
|
|
sentence, as selected with option @code{-S}. But there is a maximum
|
|
allowed output line width, changeable through option @code{-w}, which is
|
|
further divided into space for various output fields. When a field has
|
|
to be truncated because cannot extend until the beginning or the end of
|
|
the current line to fit in the, then a truncation occurs. By default,
|
|
the string used is a single slash, as in @code{-F /}.
|
|
|
|
@var{string} may have more than one character, as in @code{-F ...}.
|
|
Also, in the particular case @var{string} is empty (@code{-F ""}),
|
|
truncation flagging is disabled, and no truncation marks are appended in
|
|
this case.
|
|
|
|
As a matter of convenience to the user, many usual backslashed escape
|
|
sequences, as found in the C language, are recognized and converted to
|
|
the corresponding characters by @code{ptx} itself.
|
|
|
|
@item -M @var{string}
|
|
@itemx --macro-name=@var{string}
|
|
|
|
Select another @var{string} to be used instead of @samp{xx}, while
|
|
generating output suitable for @code{nroff}, @code{troff} or @TeX{}.
|
|
|
|
@item -O
|
|
@itemx --format=roff
|
|
|
|
Choose an output format suitable for @code{nroff} or @code{troff}
|
|
processing. Each output line will look like:
|
|
|
|
@example
|
|
.xx "@var{tail}" "@var{before}" "@var{keyword_and_after}" "@var{head}" "@var{ref}"
|
|
@end example
|
|
|
|
so it will be possible to write an @samp{.xx} roff macro to take care of
|
|
the output typesetting. This is the default output format when GNU
|
|
extensions are disabled. Option @samp{-M} might be used to change
|
|
@samp{xx} to another macro name.
|
|
|
|
In this output format, each non-graphical character, like newline and
|
|
tab, is merely changed to exactly one space, with no special attempt to
|
|
compress consecutive spaces. Each quote character: @kbd{"} is doubled
|
|
so it will be correctly processed by @code{nroff} or @code{troff}.
|
|
|
|
@item -T
|
|
@itemx --format=tex
|
|
|
|
Choose an output format suitable for @TeX{} processing. Each output
|
|
line will look like:
|
|
|
|
@example
|
|
\xx @{@var{tail}@}@{@var{before}@}@{@var{keyword}@}@{@var{after}@}@{@var{head}@}@{@var{ref}@}
|
|
@end example
|
|
|
|
@noindent
|
|
so it will be possible to write write a @code{\xx} definition to take
|
|
care of the output typesetting. Note that when references are not being
|
|
produced, that is, neither option @code{-A} nor option @code{-r} is
|
|
selected, the last parameter of each @code{\xx} call is inhibited.
|
|
Option @samp{-M} might be used to change @samp{xx} to another macro
|
|
name.
|
|
|
|
In this output format, some special characters, like @kbd{$}, @kbd{%},
|
|
@kbd{&}, @kbd{#} and @kbd{_} are automatically protected with a
|
|
backslash. Curly brackets @kbd{@{}, @kbd{@}} are also protected with a
|
|
backslash, but also enclosed in a pair of dollar signs to force
|
|
mathematical mode. The backslash itself produces the sequence
|
|
@code{\backslash@{@}}. Circumflex and tilde diacritics produce the
|
|
sequence @code{^\@{ @}} and @code{~\@{ @}} respectively. Other
|
|
diacriticized characters of the underlying character set produce an
|
|
appropriate @TeX{} sequence as far as possible. The other non-graphical
|
|
characters, like newline and tab, and all others characters which are
|
|
not part of ASCII, are merely changed to exactly one space, with no
|
|
special attempt to compress consecutive spaces. Let me know how to
|
|
improve this special character processing for @TeX{}.
|
|
|
|
@end table
|
|
|
|
|
|
@node Compatibility in ptx
|
|
@subsection The GNU extensions to @code{ptx}
|
|
|
|
This version of @code{ptx} contains a few features which do not exist in
|
|
System V @code{ptx}. These extra features are suppressed by using the
|
|
@samp{-G} command line option, unless overridden by other command line
|
|
options. Some GNU extensions cannot be recovered by overriding, so the
|
|
simple rule is to avoid @samp{-G} if you care about GNU extensions.
|
|
Here are the differences between this program and System V @code{ptx}.
|
|
|
|
@itemize @bullet
|
|
|
|
@item
|
|
This program can read many input files at once, it always writes the
|
|
resulting concordance on standard output. On the other end, System V
|
|
@code{ptx} reads only one file and produce the result on standard output
|
|
or, if a second @var{file} parameter is given on the command, to that
|
|
@var{file}.
|
|
|
|
Having output parameters not introduced by options is a quite dangerous
|
|
practice which GNU avoids as far as possible. So, for using @code{ptx}
|
|
portably between GNU and System V, you should pay attention to always
|
|
use it with a single input file, and always expect the result on
|
|
standard output. You might also want to automatically configure in a
|
|
@samp{-G} option to @code{ptx} calls in products using @code{ptx}, if
|
|
the configurator finds that the installed @code{ptx} accepts @samp{-G}.
|
|
|
|
@item
|
|
The only options available in System V @code{ptx} are options @samp{-b},
|
|
@samp{-f}, @samp{-g}, @samp{-i}, @samp{-o}, @samp{-r}, @samp{-t} and
|
|
@samp{-w}. All other options are GNU extensions and are not repeated in
|
|
this enumeration. Moreover, some options have a slightly different
|
|
meaning when GNU extensions are enabled, as explained below.
|
|
|
|
@item
|
|
By default, concordance output is not formatted for @code{troff} or
|
|
@code{nroff}. It is rather formatted for a dumb terminal. @code{troff}
|
|
or @code{nroff} output may still be selected through option @code{-O}.
|
|
|
|
@item
|
|
Unless @code{-R} option is used, the maximum reference width is
|
|
subtracted from the total output line width. With GNU extensions
|
|
disabled, width of references is not taken into account in the output
|
|
line width computations.
|
|
|
|
@item
|
|
All 256 characters, even @kbd{NUL}s, are always read and processed from
|
|
input file with no adverse effect, even if GNU extensions are disabled.
|
|
However, System V @code{ptx} does not accept 8-bit characters, a few
|
|
control characters are rejected, and the tilda @kbd{~} is condemned.
|
|
|
|
@item
|
|
Input line length is only limited by available memory, even if GNU
|
|
extensions are disabled. However, System V @code{ptx} processes only
|
|
the first 200 characters in each line.
|
|
|
|
@item
|
|
The break (non-word) characters default to be every character except all
|
|
letters of the underlying character set, diacriticized or not. When GNU
|
|
extensions are disabled, the break characters default to space, tab and
|
|
newline only.
|
|
|
|
@item
|
|
The program makes better use of output line width. If GNU extensions
|
|
are disabled, the program rather tries to imitate System V @code{ptx},
|
|
but still, there are some slight disposition glitches this program does
|
|
not completely reproduce.
|
|
|
|
@item
|
|
The user can specify both an Ignore file and an Only file. This is not
|
|
allowed with System V @code{ptx}.
|
|
|
|
@end itemize
|
|
|
|
|
|
@node Operating on fields within a line
|
|
@chapter Operating on fields within a line
|
|
|
|
@menu
|
|
* cut invocation:: Print selected parts of lines.
|
|
* paste invocation:: Merge lines of files.
|
|
* join invocation:: Join lines on a common field.
|
|
@end menu
|
|
|
|
|
|
@node cut invocation
|
|
@section @code{cut}: Print selected parts of lines
|
|
|
|
@pindex cut
|
|
@code{cut} writes to standard output selected parts of each line of each
|
|
input file, or standard input if no files are given or for a file name of
|
|
@samp{-}. Synopsis:
|
|
|
|
@example
|
|
cut [@var{option}]@dots{} [@var{file}]@dots{}
|
|
@end example
|
|
|
|
In the table which follows, the @var{byte-list}, @var{character-list},
|
|
and @var{field-list} are one or more numbers or ranges (two numbers
|
|
separated by a dash) separated by commas. Bytes, characters, and
|
|
fields are numbered from starting at 1. Incomplete ranges may be
|
|
given: @samp{-@var{m}} means @samp{1-@var{m}}; @samp{@var{n}-} means
|
|
@samp{@var{n}} through end of line or last field.
|
|
|
|
The program accepts the following options. Also see @ref{Common
|
|
options}.
|
|
|
|
@table @samp
|
|
|
|
@item -b @var{byte-list}
|
|
@itemx --bytes=@var{byte-list}
|
|
@opindex -b
|
|
@opindex --bytes
|
|
Print only the bytes in positions listed in @var{byte-list}. Tabs and
|
|
backspaces are treated like any other character; they take up 1 byte.
|
|
|
|
@item -c @var{character-list}
|
|
@itemx --characters=@var{character-list}
|
|
@opindex -c
|
|
@opindex --characters
|
|
Print only characters in positions listed in @var{character-list}.
|
|
The same as @samp{-b} for now, but internationalization will change
|
|
that. Tabs and backspaces are treated like any other character; they
|
|
take up 1 character.
|
|
|
|
@item -f @var{field-list}
|
|
@itemx --fields=@var{field-list}
|
|
@opindex -f
|
|
@opindex --fields
|
|
Print only the fields listed in @var{field-list}. Fields are
|
|
separated by a @key{TAB} by default.
|
|
|
|
@item -d @var{delim}
|
|
@itemx --delimiter=@var{delim}
|
|
@opindex -d
|
|
@opindex --delimiter
|
|
For @samp{-f}, fields are separated by the first character in @var{delim}
|
|
(default is @key{TAB}).
|
|
|
|
@item -n
|
|
@opindex -n
|
|
Do not split multi-byte characters (no-op for now).
|
|
|
|
@item -s
|
|
@itemx --only-delimited
|
|
@opindex -s
|
|
@opindex --only-delimited
|
|
For @samp{-f}, do not print lines that do not contain the field separator
|
|
character.
|
|
|
|
@end table
|
|
|
|
|
|
@node paste invocation
|
|
@section @code{paste}: Merge lines of files
|
|
|
|
@pindex paste
|
|
@cindex merging files
|
|
|
|
@code{paste} writes to standard output lines consisting of sequentially
|
|
corresponding lines of each given file, separated by @key{TAB}.
|
|
Standard input is used for a file name of @samp{-} or if no input files
|
|
are given.
|
|
|
|
Synopsis:
|
|
|
|
@example
|
|
paste [@var{option}]@dots{} [@var{file}]@dots{}
|
|
@end example
|
|
|
|
The program accepts the following options. Also see @ref{Common options}.
|
|
|
|
@table @samp
|
|
|
|
@item -s
|
|
@itemx --serial
|
|
@opindex -s
|
|
@opindex --serial
|
|
Paste the lines of one file at a time rather than one line from each
|
|
file.
|
|
|
|
@item -d @var{delim-list}
|
|
@itemx --delimiters @var{delim-list}
|
|
@opindex -d
|
|
@opindex --delimiters
|
|
Consecutively use the characters in @var{delim-list} instead of
|
|
@key{TAB} to separate merged lines. When @var{delim-list} is
|
|
exhausted, start again at its beginning.
|
|
|
|
@end table
|
|
|
|
|
|
@node join invocation
|
|
@section @code{join}: Join lines on a common field
|
|
|
|
@pindex join
|
|
@cindex common field, joining on
|
|
|
|
@code{join} writes to standard output a line for each pair of input
|
|
lines that have identical join fields. Synopsis:
|
|
|
|
@example
|
|
join [@var{option}]@dots{} @var{file1} @var{file2}
|
|
@end example
|
|
|
|
Either @var{file1} or @var{file2} (but not both) can be @samp{-},
|
|
meaning standard input. @var{file1} and @var{file2} should be already
|
|
sorted in increasing order (not numerically) on the join fields; unless
|
|
the @samp{-t} option is given, they should be sorted ignoring blanks at
|
|
the start of the join field, as in @code{sort -b}. If the
|
|
@samp{--ignore-case} option is given, lines should be sorted without
|
|
regard to the case of characters in the join field, as in @code{sort -f}.
|
|
|
|
The defaults are: the join field is the first field in each line;
|
|
fields in the input are separated by one or more blanks, with leading
|
|
blanks on the line ignored; fields in the output are separated by a
|
|
space; each output line consists of the join field, the remaining
|
|
fields from @var{file1}, then the remaining fields from @var{file2}.
|
|
|
|
The program accepts the following options. Also see @ref{Common options}.
|
|
|
|
@table @samp
|
|
|
|
@item -a @var{file-number}
|
|
@opindex -a
|
|
Print a line for each unpairable line in file @var{file-number} (either
|
|
@samp{1} or @samp{2}), in addition to the normal output.
|
|
|
|
@item -e @var{string}
|
|
@opindex -e
|
|
Replace those output fields that are missing in the input with
|
|
@var{string}.
|
|
|
|
@item -i
|
|
@itemx --ignore-case
|
|
@opindex -i
|
|
@opindex --ignore-case
|
|
Ignore differences in case when comparing keys.
|
|
With this option, the lines of the input files must be ordered in the same way.
|
|
Use @samp{sort -f} to produce this ordering.
|
|
|
|
@item -1 @var{field}
|
|
@itemx -j1 @var{field}
|
|
@opindex -1
|
|
@opindex -j1
|
|
Join on field @var{field} (a positive integer) of file 1.
|
|
|
|
@item -2 @var{field}
|
|
@itemx -j2 @var{field}
|
|
@opindex -2
|
|
@opindex -j2
|
|
Join on field @var{field} (a positive integer) of file 2.
|
|
|
|
@item -j @var{field}
|
|
Equivalent to @samp{-1 @var{field} -2 @var{field}}.
|
|
|
|
@item -o @var{field-list}@dots{}
|
|
Construct each output line according to the format in @var{field-list}.
|
|
Each element in @var{field-list} is either the single character @samp{0} or
|
|
has the form @var{m.n} where the file number, @var{m}, is @samp{1} or
|
|
@samp{2} and @var{n} is a positive field number.
|
|
|
|
A field specification of @samp{0} denotes the join field.
|
|
In most cases, the functionality of the @samp{0} field spec
|
|
may be reproduced using the explicit @var{m.n} that corresponds
|
|
to the join field. However, when printing unpairable lines
|
|
(using either of the @samp{-a} or @samp{-v} options), there is no way
|
|
to specify the join field using @var{m.n} in @var{field-list}
|
|
if there are unpairable lines in both files.
|
|
To give @code{join} that functionality, @sc{POSIX} invented the @samp{0}
|
|
field specification notation.
|
|
|
|
The elements in @var{field-list}
|
|
are separated by commas or blanks. Multiple @var{field-list}
|
|
arguments can be given after a single @samp{-o} option; the values
|
|
of all lists given with @samp{-o} are concatenated together.
|
|
All output lines -- including those printed because of any -a or -v
|
|
option -- are subject to the specified @var{field-list}.
|
|
|
|
@item -t @var{char}
|
|
Use character @var{char} as the input and output field separator.
|
|
|
|
@item -v @var{file-number}
|
|
Print a line for each unpairable line in file @var{file-number}
|
|
(either @samp{1} or @samp{2}), instead of the normal output.
|
|
|
|
@end table
|
|
|
|
In addition, when GNU @code{join} is invoked with exactly one argument,
|
|
options @samp{--help} and @samp{--version} are recognized. @xref{Common
|
|
options}.
|
|
|
|
|
|
@node Operating on characters
|
|
@chapter Operating on characters
|
|
|
|
@cindex operating on characters
|
|
|
|
This commands operate on individual characters.
|
|
|
|
@menu
|
|
* tr invocation:: Translate, squeeze, and/or delete characters.
|
|
* expand invocation:: Convert tabs to spaces.
|
|
* unexpand invocation:: Convert spaces to tabs.
|
|
@end menu
|
|
|
|
|
|
@node tr invocation
|
|
@section @code{tr}: Translate, squeeze, and/or delete characters
|
|
|
|
@pindex tr
|
|
|
|
Synopsis:
|
|
|
|
@example
|
|
tr [@var{option}]@dots{} @var{set1} [@var{set2}]
|
|
@end example
|
|
|
|
@code{tr} copies standard input to standard output, performing
|
|
one of the following operations:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
translate, and optionally squeeze repeated characters in the result,
|
|
@item
|
|
squeeze repeated characters,
|
|
@item
|
|
delete characters,
|
|
@item
|
|
delete characters, then squeeze repeated characters from the result.
|
|
@end itemize
|
|
|
|
The @var{set1} and (if given) @var{set2} arguments define ordered
|
|
sets of characters, referred to below as @var{set1} and @var{set2}. These
|
|
sets are the characters of the input that @code{tr} operates on.
|
|
The @samp{--complement} (@samp{-c}) option replaces @var{set1} with its
|
|
complement (all of the characters that are not in @var{set1}).
|
|
|
|
@menu
|
|
* Character sets:: Specifying sets of characters.
|
|
* Translating:: Changing one characters to another.
|
|
* Squeezing:: Squeezing repeats and deleting.
|
|
* Warnings in tr:: Warning messages.
|
|
@end menu
|
|
|
|
|
|
@node Character sets
|
|
@subsection Specifying sets of characters
|
|
|
|
@cindex specifying sets of characters
|
|
|
|
The format of the @var{set1} and @var{set2} arguments resembles
|
|
the format of regular expressions; however, they are not regular
|
|
expressions, only lists of characters. Most characters simply
|
|
represent themselves in these strings, but the strings can contain
|
|
the shorthands listed below, for convenience. Some of them can be
|
|
used only in @var{set1} or @var{set2}, as noted below.
|
|
|
|
@table @asis
|
|
|
|
@item Backslash escapes
|
|
@cindex backslash escapes
|
|
|
|
A backslash followed by a character not listed below causes an error
|
|
message.
|
|
|
|
@table @samp
|
|
@item \a
|
|
Control-G.
|
|
@item \b
|
|
Control-H.
|
|
@item \f
|
|
Control-L.
|
|
@item \n
|
|
Control-J.
|
|
@item \r
|
|
Control-M.
|
|
@item \t
|
|
Control-I.
|
|
@item \v
|
|
Control-K.
|
|
@item \@var{ooo}
|
|
The character with the value given by @var{ooo}, which is 1 to 3
|
|
octal digits,
|
|
@item \\
|
|
A backslash.
|
|
@end table
|
|
|
|
@item Ranges
|
|
@cindex ranges
|
|
|
|
The notation @samp{@var{m}-@var{n}} expands to all of the characters
|
|
from @var{m} through @var{n}, in ascending order. @var{m} should
|
|
collate before @var{n}; if it doesn't, an error results. As an example,
|
|
@samp{0-9} is the same as @samp{0123456789}. Although GNU @code{tr}
|
|
does not support the System V syntax that uses square brackets to
|
|
enclose ranges, translations specified in that format will still work as
|
|
long as the brackets in @var{string1} correspond to identical brackets
|
|
in @var{string2}.
|
|
|
|
@item Repeated characters
|
|
@cindex repeated characters
|
|
|
|
The notation @samp{[@var{c}*@var{n}]} in @var{set2} expands to @var{n}
|
|
copies of character @var{c}. Thus, @samp{[y*6]} is the same as
|
|
@samp{yyyyyy}. The notation @samp{[@var{c}*]} in @var{string2} expands
|
|
to as many copies of @var{c} as are needed to make @var{set2} as long as
|
|
@var{set1}. If @var{n} begins with @samp{0}, it is interpreted in
|
|
octal, otherwise in decimal.
|
|
|
|
@item Character classes
|
|
@cindex characters classes
|
|
|
|
The notation @samp{[:@var{class}:]} expands to all of the characters in
|
|
the (predefined) class @var{class}. The characters expand in no
|
|
particular order, except for the @code{upper} and @code{lower} classes,
|
|
which expand in ascending order. When the @samp{--delete} (@samp{-d})
|
|
and @samp{--squeeze-repeats} (@samp{-s}) options are both given, any
|
|
character class can be used in @var{set2}. Otherwise, only the
|
|
character classes @code{lower} and @code{upper} are accepted in
|
|
@var{set2}, and then only if the corresponding character class
|
|
(@code{upper} and @code{lower}, respectively) is specified in the same
|
|
relative position in @var{set1}. Doing this specifies case conversion.
|
|
The class names are given below; an error results when an invalid class
|
|
name is given.
|
|
|
|
@table @code
|
|
@item alnum
|
|
@opindex alnum
|
|
Letters and digits.
|
|
@item alpha
|
|
@opindex alpha
|
|
Letters.
|
|
@item blank
|
|
@opindex blank
|
|
Horizontal whitespace.
|
|
@item cntrl
|
|
@opindex cntrl
|
|
Control characters.
|
|
@item digit
|
|
@opindex digit
|
|
Digits.
|
|
@item graph
|
|
@opindex graph
|
|
Printable characters, not including space.
|
|
@item lower
|
|
@opindex lower
|
|
Lowercase letters.
|
|
@item print
|
|
@opindex print
|
|
Printable characters, including space.
|
|
@item punct
|
|
@opindex punct
|
|
Punctuation characters.
|
|
@item space
|
|
@opindex space
|
|
Horizontal or vertical whitespace.
|
|
@item upper
|
|
@opindex upper
|
|
Uppercase letters.
|
|
@item xdigit
|
|
@opindex xdigit
|
|
Hexadecimal digits.
|
|
@end table
|
|
|
|
@item Equivalence classes
|
|
@cindex equivalence classes
|
|
|
|
The syntax @samp{[=@var{c}=]} expands to all of the characters that are
|
|
equivalent to @var{c}, in no particular order. Equivalence classes are
|
|
a relatively recent invention intended to support non-English alphabets.
|
|
But there seems to be no standard way to define them or determine their
|
|
contents. Therefore, they are not fully implemented in GNU @code{tr};
|
|
each character's equivalence class consists only of that character,
|
|
which is of no particular use.
|
|
|
|
@end table
|
|
|
|
|
|
@node Translating
|
|
@subsection Translating
|
|
|
|
@cindex translating characters
|
|
|
|
@code{tr} performs translation when @var{set1} and @var{set2} are
|
|
both given and the @samp{--delete} (@samp{-d}) option is not given.
|
|
@code{tr} translates each character of its input that is in @var{set1}
|
|
to the corresponding character in @var{set2}. Characters not in
|
|
@var{set1} are passed through unchanged. When a character appears more
|
|
than once in @var{set1} and the corresponding characters in @var{set2}
|
|
are not all the same, only the final one is used. For example, these
|
|
two commands are equivalent:
|
|
|
|
@example
|
|
tr aaa xyz
|
|
tr a z
|
|
@end example
|
|
|
|
A common use of @code{tr} is to convert lowercase characters to
|
|
uppercase. This can be done in many ways. Here are three of them:
|
|
|
|
@example
|
|
tr abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ
|
|
tr a-z A-Z
|
|
tr '[:lower:]' '[:upper:]'
|
|
@end example
|
|
|
|
When @code{tr} is performing translation, @var{set1} and @var{set2}
|
|
typically have the same length. If @var{set1} is shorter than
|
|
@var{set2}, the extra characters at the end of @var{set2} are ignored.
|
|
|
|
On the other hand, making @var{set1} longer than @var{set2} is not
|
|
portable; @sc{POSIX.2} says that the result is undefined. In this situation,
|
|
BSD @code{tr} pads @var{set2} to the length of @var{set1} by repeating
|
|
the last character of @var{set2} as many times as necessary. System V
|
|
@code{tr} truncates @var{set1} to the length of @var{set2}.
|
|
|
|
By default, GNU @code{tr} handles this case like BSD @code{tr}. When
|
|
the @samp{--truncate-set1} (@samp{-t}) option is given, GNU @code{tr}
|
|
handles this case like the System V @code{tr} instead. This option is
|
|
ignored for operations other than translation.
|
|
|
|
Acting like System V @code{tr} in this case breaks the relatively common
|
|
BSD idiom:
|
|
|
|
@example
|
|
tr -cs A-Za-z0-9 '\012'
|
|
@end example
|
|
|
|
@noindent
|
|
because it converts only zero bytes (the first element in the
|
|
complement of @var{set1}), rather than all non-alphanumerics, to
|
|
newlines.
|
|
|
|
|
|
@node Squeezing
|
|
@subsection Squeezing repeats and deleting
|
|
|
|
@cindex squeezing repeat characters
|
|
@cindex deleting characters
|
|
|
|
When given just the @samp{--delete} (@samp{-d}) option, @code{tr}
|
|
removes any input characters that are in @var{set1}.
|
|
|
|
When given just the @samp{--squeeze-repeats} (@samp{-s}) option,
|
|
@code{tr} replaces each input sequence of a repeated character that
|
|
is in @var{set1} with a single occurrence of that character.
|
|
|
|
When given both @samp{--delete} and @samp{--squeeze-repeats}, @code{tr}
|
|
first performs any deletions using @var{set1}, then squeezes repeats
|
|
from any remaining characters using @var{set2}.
|
|
|
|
The @samp{--squeeze-repeats} option may also be used when translating,
|
|
in which case @code{tr} first performs translation, then squeezes
|
|
repeats from any remaining characters using @var{set2}.
|
|
|
|
Here are some examples to illustrate various combinations of options:
|
|
|
|
@itemize @bullet
|
|
|
|
@item
|
|
Remove all zero bytes:
|
|
|
|
@example
|
|
tr -d '\000'
|
|
@end example
|
|
|
|
@item
|
|
Put all words on lines by themselves. This converts all
|
|
non-alphanumeric characters to newlines, then squeezes each string
|
|
of repeated newlines into a single newline:
|
|
|
|
@example
|
|
tr -cs '[a-zA-Z0-9]' '[\n*]'
|
|
@end example
|
|
|
|
@item
|
|
Convert each sequence of repeated newlines to a single newline:
|
|
|
|
@example
|
|
tr -s '\n'
|
|
@end example
|
|
|
|
@item
|
|
Find doubled occurrences of words in a document.
|
|
For example, people often write ``the the'' with the duplicated words
|
|
separated by a newline. The bourne shell script below works first
|
|
by converting each sequence of punctuation and blank characters to a
|
|
single newline. That puts each ``word'' on a line by itself.
|
|
Next it maps all uppercase characters to lower case, and finally it
|
|
runs @code{uniq} with the @samp{-d} option to print out only the words
|
|
that were adjacent duplicates.
|
|
|
|
@example
|
|
#!/bin/sh
|
|
cat "$@@" \
|
|
| tr -s '[:punct:][:blank:]' '\n' \
|
|
| tr '[:upper:]' '[:lower:]' \
|
|
| uniq -d
|
|
@end example
|
|
|
|
@end itemize
|
|
|
|
|
|
@node Warnings in tr
|
|
@subsection Warning messages
|
|
|
|
@vindex POSIXLY_CORRECT
|
|
Setting the environment variable @code{POSIXLY_CORRECT} turns off the
|
|
following warning and error messages, for strict compliance with
|
|
@sc{POSIX.2}. Otherwise, the following diagnostics are issued:
|
|
|
|
@enumerate
|
|
|
|
@item
|
|
When the @samp{--delete} option is given but @samp{--squeeze-repeats}
|
|
is not, and @var{set2} is given, GNU @code{tr} by default prints
|
|
a usage message and exits, because @var{set2} would not be used.
|
|
The @sc{POSIX} specification says that @var{set2} must be ignored in
|
|
this case. Silently ignoring arguments is a bad idea.
|
|
|
|
@item
|
|
When an ambiguous octal escape is given. For example, @samp{\400}
|
|
is actually @samp{\40} followed by the digit @samp{0}, because the
|
|
value 400 octal does not fit into a single byte.
|
|
|
|
@end enumerate
|
|
|
|
GNU @code{tr} does not provide complete BSD or System V compatibility.
|
|
For example, it is impossible to disable interpretation of the @sc{POSIX}
|
|
constructs @samp{[:alpha:]}, @samp{[=c=]}, and @samp{[c*10]}. Also, GNU
|
|
@code{tr} does not delete zero bytes automatically, unlike traditional
|
|
Unix versions, which provide no way to preserve zero bytes.
|
|
|
|
|
|
@node expand invocation
|
|
@section @code{expand}: Convert tabs to spaces
|
|
|
|
@pindex expand
|
|
@cindex tabs to spaces, converting
|
|
@cindex converting tabs to spaces
|
|
|
|
@code{expand} writes the contents of each given @var{file}, or standard
|
|
input if none are given or for a @var{file} of @samp{-}, to standard
|
|
output, with tab characters converted to the appropriate number of
|
|
spaces. Synopsis:
|
|
|
|
@example
|
|
expand [@var{option}]@dots{} [@var{file}]@dots{}
|
|
@end example
|
|
|
|
By default, @code{expand} converts all tabs to spaces. It preserves
|
|
backspace characters in the output; they decrement the column count for
|
|
tab calculations. The default action is equivalent to @samp{-8} (set
|
|
tabs every 8 columns).
|
|
|
|
The program accepts the following options. Also see @ref{Common options}.
|
|
|
|
@table @samp
|
|
|
|
@item -@var{tab1}[,@var{tab2}]@dots{}
|
|
@itemx -t @var{tab1}[,@var{tab2}]@dots{}
|
|
@itemx --tabs=@var{tab1}[,@var{tab2}]@dots{}
|
|
@opindex -@var{tab}
|
|
@opindex -t
|
|
@opindex --tabs
|
|
@cindex tabstops, setting
|
|
If only one tab stop is given, set the tabs @var{tab1} spaces apart
|
|
(default is 8). Otherwise, set the tabs at columns @var{tab1},
|
|
@var{tab2}, @dots{} (numbered from 0), and replace any tabs beyond the
|
|
last tabstop given with single spaces. If the tabstops are specified
|
|
with the @samp{-t} or @samp{--tabs} option, they can be separated by
|
|
blanks as well as by commas.
|
|
|
|
@item -i
|
|
@itemx --initial
|
|
@opindex -i
|
|
@opindex --initial
|
|
@cindex initial tabs, converting
|
|
Only convert initial tabs (those that precede all non-space or non-tab
|
|
characters) on each line to spaces.
|
|
|
|
@end table
|
|
|
|
|
|
@node unexpand invocation
|
|
@section @code{unexpand}: Convert spaces to tabs
|
|
|
|
@pindex unexpand
|
|
|
|
@code{unexpand} writes the contents of each given @var{file}, or
|
|
standard input if none are given or for a @var{file} of @samp{-}, to
|
|
standard output, with strings of two or more space or tab characters
|
|
converted to as many tabs as possible followed by as many spaces as are
|
|
needed. Synopsis:
|
|
|
|
@example
|
|
unexpand [@var{option}]@dots{} [@var{file}]@dots{}
|
|
@end example
|
|
|
|
By default, @code{unexpand} converts only initial spaces and tabs (those
|
|
that precede all non space or tab characters) on each line. It
|
|
preserves backspace characters in the output; they decrement the column
|
|
count for tab calculations. By default, tabs are set at every 8th
|
|
column.
|
|
|
|
The program accepts the following options. Also see @ref{Common options}.
|
|
|
|
@table @samp
|
|
|
|
@item -@var{tab1}[,@var{tab2}]@dots{}
|
|
@itemx -t @var{tab1}[,@var{tab2}]@dots{}
|
|
@itemx --tabs=@var{tab1}[,@var{tab2}]@dots{}
|
|
@opindex -@var{tab}
|
|
@opindex -t
|
|
@opindex --tabs
|
|
If only one tab stop is given, set the tabs @var{tab1} spaces apart
|
|
instead of the default 8. Otherwise, set the tabs at columns
|
|
@var{tab1}, @var{tab2}, @dots{} (numbered from 0), and leave spaces and
|
|
tabs beyond the tabstops given unchanged. If the tabstops are specified
|
|
with the @samp{-t} or @samp{--tabs} option, they can be separated by
|
|
blanks as well as by commas. This option implies the @samp{-a} option.
|
|
|
|
@item -a
|
|
@itemx --all
|
|
@opindex -a
|
|
@opindex --all
|
|
Convert all strings of two or more spaces or tabs, not just initial
|
|
ones, to tabs.
|
|
|
|
@end table
|
|
|
|
@c What's GNU?
|
|
@c Arnold Robbins
|
|
@node Opening the software toolbox
|
|
@chapter Opening the software toolbox
|
|
|
|
This chapter originally appeared in @cite{Linux Journal}, volume 1,
|
|
number 2, in the @cite{What's GNU?} column. It was written by Arnold
|
|
Robbins.
|
|
|
|
@menu
|
|
* Toolbox introduction:: Toolbox introduction
|
|
* I/O redirection:: I/O redirection
|
|
* The who command:: The @code{who} command
|
|
* The cut command:: The @code{cut} command
|
|
* The sort command:: The @code{sort} command
|
|
* The uniq command:: The @code{uniq} command
|
|
* Putting the tools together:: Putting the tools together
|
|
@end menu
|
|
|
|
|
|
@node Toolbox introduction
|
|
@unnumberedsec Toolbox introduction
|
|
|
|
This month's column is only peripherally related to the GNU Project, in
|
|
that it describes a number of the GNU tools on your Linux system and how they
|
|
might be used. What it's really about is the ``Software Tools'' philosophy
|
|
of program development and usage.
|
|
|
|
The software tools philosophy was an important and integral concept
|
|
in the initial design and development of Unix (of which Linux and GNU are
|
|
essentially clones). Unfortunately, in the modern day press of
|
|
Internetworking and flashy GUIs, it seems to have fallen by the
|
|
wayside. This is a shame, since it provides a powerful mental model
|
|
for solving many kinds of problems.
|
|
|
|
Many people carry a Swiss Army knife around in their pants pockets (or
|
|
purse). A Swiss Army knife is a handy tool to have: it has several knife
|
|
blades, a screwdriver, tweezers, toothpick, nail file, corkscrew, and perhaps
|
|
a number of other things on it. For the everyday, small miscellaneous jobs
|
|
where you need a simple, general purpose tool, it's just the thing.
|
|
|
|
On the other hand, an experienced carpenter doesn't build a house using
|
|
a Swiss Army knife. Instead, he has a toolbox chock full of specialized
|
|
tools---a saw, a hammer, a screwdriver, a plane, and so on. And he knows
|
|
exactly when and where to use each tool; you won't catch him hammering nails
|
|
with the handle of his screwdriver.
|
|
|
|
The Unix developers at Bell Labs were all professional programmers and trained
|
|
computer scientists. They had found that while a one-size-fits-all program
|
|
might appeal to a user because there's only one program to use, in practice
|
|
such programs are
|
|
|
|
@enumerate a
|
|
@item
|
|
difficult to write,
|
|
|
|
@item
|
|
difficult to maintain and
|
|
debug, and
|
|
|
|
@item
|
|
difficult to extend to meet new situations.
|
|
@end enumerate
|
|
|
|
Instead, they felt that programs should be specialized tools. In short, each
|
|
program ``should do one thing well.'' No more and no less. Such programs are
|
|
simpler to design, write, and get right---they only do one thing.
|
|
|
|
Furthermore, they found that with the right machinery for hooking programs
|
|
together, that the whole was greater than the sum of the parts. By combining
|
|
several special purpose programs, you could accomplish a specific task
|
|
that none of the programs was designed for, and accomplish it much more
|
|
quickly and easily than if you had to write a special purpose program.
|
|
We will see some (classic) examples of this further on in the column.
|
|
(An important additional point was that, if necessary, take a detour
|
|
and build any software tools you may need first, if you don't already
|
|
have something appropriate in the toolbox.)
|
|
|
|
@node I/O redirection
|
|
@unnumberedsec I/O redirection
|
|
|
|
Hopefully, you are familiar with the basics of I/O redirection in the
|
|
shell, in particular the concepts of ``standard input,'' ``standard output,''
|
|
and ``standard error''. Briefly, ``standard input'' is a data source, where
|
|
data comes from. A program should not need to either know or care if the
|
|
data source is a disk file, a keyboard, a magnetic tape, or even a punched
|
|
card reader. Similarly, ``standard output'' is a data sink, where data goes
|
|
to. The program should neither know nor care where this might be.
|
|
Programs that only read their standard input, do something to the data,
|
|
and then send it on, are called ``filters'', by analogy to filters in a
|
|
water pipeline.
|
|
|
|
With the Unix shell, it's very easy to set up data pipelines:
|
|
|
|
@example
|
|
program_to_create_data | filter1 | .... | filterN > final.pretty.data
|
|
@end example
|
|
|
|
We start out by creating the raw data; each filter applies some successive
|
|
transformation to the data, until by the time it comes out of the pipeline,
|
|
it is in the desired form.
|
|
|
|
This is fine and good for standard input and standard output. Where does the
|
|
standard error come in to play? Well, think about @code{filter1} in
|
|
the pipeline above. What happens if it encounters an error in the data it
|
|
sees? If it writes an error message to standard output, it will just
|
|
disappear down the pipeline into @code{filter2}'s input, and the
|
|
user will probably never see it. So programs need a place where they can send
|
|
error messages so that the user will notice them. This is standard error,
|
|
and it is usually connected to your console or window, even if you have
|
|
redirected standard output of your program away from your screen.
|
|
|
|
For filter programs to work together, the format of the data has to be
|
|
agreed upon. The most straightforward and easiest format to use is simply
|
|
lines of text. Unix data files are generally just streams of bytes, with
|
|
lines delimited by the @sc{ASCII} @sc{LF} (Line Feed) character,
|
|
conventionally called a ``newline'' in the Unix literature. (This is
|
|
@code{'\n'} if you're a C programmer.) This is the format used by all
|
|
the traditional filtering programs. (Many earlier operating systems
|
|
had elaborate facilities and special purpose programs for managing
|
|
binary data. Unix has always shied away from such things, under the
|
|
philosophy that it's easiest to simply be able to view and edit your
|
|
data with a text editor.)
|
|
|
|
OK, enough introduction. Let's take a look at some of the tools, and then
|
|
we'll see how to hook them together in interesting ways. In the following
|
|
discussion, we will only present those command line options that interest
|
|
us. As you should always do, double check your system documentation
|
|
for the full story.
|
|
|
|
@node The who command
|
|
@unnumberedsec The @code{who} command
|
|
|
|
The first program is the @code{who} command. By itself, it generates a
|
|
list of the users who are currently logged in. Although I'm writing
|
|
this on a single-user system, we'll pretend that several people are
|
|
logged in:
|
|
|
|
@example
|
|
$ who
|
|
arnold console Jan 22 19:57
|
|
miriam ttyp0 Jan 23 14:19(:0.0)
|
|
bill ttyp1 Jan 21 09:32(:0.0)
|
|
arnold ttyp2 Jan 23 20:48(:0.0)
|
|
@end example
|
|
|
|
Here, the @samp{$} is the usual shell prompt, at which I typed @code{who}.
|
|
There are three people logged in, and I am logged in twice. On traditional
|
|
Unix systems, user names are never more than eight characters long. This
|
|
little bit of trivia will be useful later. The output of @code{who} is nice,
|
|
but the data is not all that exciting.
|
|
|
|
@node The cut command
|
|
@unnumberedsec The @code{cut} command
|
|
|
|
The next program we'll look at is the @code{cut} command. This program
|
|
cuts out columns or fields of input data. For example, we can tell it
|
|
to print just the login name and full name from the @file{/etc/passwd
|
|
file}. The @file{/etc/passwd} file has seven fields, separated by
|
|
colons:
|
|
|
|
@example
|
|
arnold:xyzzy:2076:10:Arnold D. Robbins:/home/arnold:/bin/ksh
|
|
@end example
|
|
|
|
To get the first and fifth fields, we would use cut like this:
|
|
|
|
@example
|
|
$ cut -d: -f1,5 /etc/passwd
|
|
root:Operator
|
|
@dots{}
|
|
arnold:Arnold D. Robbins
|
|
miriam:Miriam A. Robbins
|
|
@dots{}
|
|
@end example
|
|
|
|
With the @samp{-c} option, @code{cut} will cut out specific characters
|
|
(i.e., columns) in the input lines. This command looks like it might be
|
|
useful for data filtering.
|
|
|
|
|
|
@node The sort command
|
|
@unnumberedsec The @code{sort} command
|
|
|
|
Next we'll look at the @code{sort} command. This is one of the most
|
|
powerful commands on a Unix-style system; one that you will often find
|
|
yourself using when setting up fancy data plumbing. The @code{sort}
|
|
command reads and sorts each file named on the command line. It then
|
|
merges the sorted data and writes it to standard output. It will read
|
|
standard input if no files are given on the command line (thus
|
|
making it into a filter). The sort is based on the machine collating
|
|
sequence (@sc{ASCII}) or based on user-supplied ordering criteria.
|
|
|
|
|
|
@node The uniq command
|
|
@unnumberedsec The @code{uniq} command
|
|
|
|
Finally (at least for now), we'll look at the @code{uniq} program. When
|
|
sorting data, you will often end up with duplicate lines, lines that
|
|
are identical. Usually, all you need is one instance of each line.
|
|
This is where @code{uniq} comes in. The @code{uniq} program reads its
|
|
standard input, which it expects to be sorted. It only prints out one
|
|
copy of each duplicated line. It does have several options. Later on,
|
|
we'll use the @samp{-c} option, which prints each unique line, preceded
|
|
by a count of the number of times that line occurred in the input.
|
|
|
|
|
|
@node Putting the tools together
|
|
@unnumberedsec Putting the tools together
|
|
|
|
Now, let's suppose this is a large BBS system with dozens of users
|
|
logged in. The management wants the SysOp to write a program that will
|
|
generate a sorted list of logged in users. Furthermore, even if a user
|
|
is logged in multiple times, his or her name should only show up in the
|
|
output once.
|
|
|
|
The SysOp could sit down with the system documentation and write a C
|
|
program that did this. It would take perhaps a couple of hundred lines
|
|
of code and about two hours to write it, test it, and debug it.
|
|
However, knowing the software toolbox, the SysOp can instead start out
|
|
by generating just a list of logged on users:
|
|
|
|
@example
|
|
$ who | cut -c1-8
|
|
arnold
|
|
miriam
|
|
bill
|
|
arnold
|
|
@end example
|
|
|
|
Next, sort the list:
|
|
|
|
@example
|
|
$ who | cut -c1-8 | sort
|
|
arnold
|
|
arnold
|
|
bill
|
|
miriam
|
|
@end example
|
|
|
|
Finally, run the sorted list through @code{uniq}, to weed out duplicates:
|
|
|
|
@example
|
|
$ who | cut -c1-8 | sort | uniq
|
|
arnold
|
|
bill
|
|
miriam
|
|
@end example
|
|
|
|
The @code{sort} command actually has a @samp{-u} option that does what
|
|
@code{uniq} does. However, @code{uniq} has other uses for which one
|
|
cannot substitute @samp{sort -u}.
|
|
|
|
The SysOp puts this pipeline into a shell script, and makes it available for
|
|
all the users on the system:
|
|
|
|
@example
|
|
# cat > /usr/local/bin/listusers
|
|
who | cut -c1-8 | sort | uniq
|
|
^D
|
|
# chmod +x /usr/local/bin/listusers
|
|
@end example
|
|
|
|
There are four major points to note here. First, with just four
|
|
programs, on one command line, the SysOp was able to save about two
|
|
hours worth of work. Furthermore, the shell pipeline is just about as
|
|
efficient as the C program would be, and it is much more efficient in
|
|
terms of programmer time. People time is much more expensive than
|
|
computer time, and in our modern ``there's never enough time to do
|
|
everything'' society, saving two hours of programmer time is no mean
|
|
feat.
|
|
|
|
Second, it is also important to emphasize that with the
|
|
@emph{combination} of the tools, it is possible to do a special
|
|
purpose job never imagined by the authors of the individual programs.
|
|
|
|
Third, it is also valuable to build up your pipeline in stages, as we did here.
|
|
This allows you to view the data at each stage in the pipeline, which helps
|
|
you acquire the confidence that you are indeed using these tools correctly.
|
|
|
|
Finally, by bundling the pipeline in a shell script, other users can use
|
|
your command, without having to remember the fancy plumbing you set up for
|
|
them. In terms of how you run them, shell scripts and compiled programs are
|
|
indistinguishable.
|
|
|
|
After the previous warm-up exercise, we'll look at two additional, more
|
|
complicated pipelines. For them, we need to introduce two more tools.
|
|
|
|
The first is the @code{tr} command, which stands for ``transliterate.''
|
|
The @code{tr} command works on a character-by-character basis, changing
|
|
characters. Normally it is used for things like mapping upper case to
|
|
lower case:
|
|
|
|
@example
|
|
$ echo ThIs ExAmPlE HaS MIXED case! | tr '[A-Z]' '[a-z]'
|
|
this example has mixed case!
|
|
@end example
|
|
|
|
There are several options of interest:
|
|
|
|
@table @samp
|
|
@item -c
|
|
work on the complement of the listed characters, i.e.,
|
|
operations apply to characters not in the given set
|
|
|
|
@item -d
|
|
delete characters in the first set from the output
|
|
|
|
@item -s
|
|
squeeze repeated characters in the output into just one character.
|
|
@end table
|
|
|
|
We will be using all three options in a moment.
|
|
|
|
The other command we'll look at is @code{comm}. The @code{comm}
|
|
command takes two sorted input files as input data, and prints out the
|
|
files' lines in three columns. The output columns are the data lines
|
|
unique to the first file, the data lines unique to the second file, and
|
|
the data lines that are common to both. The @samp{-1}, @samp{-2}, and
|
|
@samp{-3} command line options omit the respective columns. (This is
|
|
non-intuitive and takes a little getting used to.) For example:
|
|
|
|
@example
|
|
$ cat f1
|
|
11111
|
|
22222
|
|
33333
|
|
44444
|
|
$ cat f2
|
|
00000
|
|
22222
|
|
33333
|
|
55555
|
|
$ comm f1 f2
|
|
00000
|
|
11111
|
|
22222
|
|
33333
|
|
44444
|
|
55555
|
|
@end example
|
|
|
|
The single dash as a filename tells @code{comm} to read standard input
|
|
instead of a regular file.
|
|
|
|
Now we're ready to build a fancy pipeline. The first application is a word
|
|
frequency counter. This helps an author determine if he or she is over-using
|
|
certain words.
|
|
|
|
The first step is to change the case of all the letters in our input file
|
|
to one case. ``The'' and ``the'' are the same word when doing counting.
|
|
|
|
@example
|
|
$ tr '[A-Z]' '[a-z]' < whats.gnu | ...
|
|
@end example
|
|
|
|
The next step is to get rid of punctuation. Quoted words and unquoted words
|
|
should be treated identically; it's easiest to just get the punctuation out of
|
|
the way.
|
|
|
|
@example
|
|
$ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' | ...
|
|
@end example
|
|
|
|
The second @code{tr} command operates on the complement of the listed
|
|
characters, which are all the letters, the digits, the underscore, and
|
|
the blank. The @samp{\012} represents the newline character; it has to
|
|
be left alone. (The ASCII TAB character should also be included for
|
|
good measure in a production script.)
|
|
|
|
At this point, we have data consisting of words separated by blank space.
|
|
The words only contain alphanumeric characters (and the underscore). The
|
|
next step is break the data apart so that we have one word per line. This
|
|
makes the counting operation much easier, as we will see shortly.
|
|
|
|
@example
|
|
$ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
|
|
> tr -s '[ ]' '\012' | ...
|
|
@end example
|
|
|
|
This command turns blanks into newlines. The @samp{-s} option squeezes
|
|
multiple newline characters in the output into just one. This helps us
|
|
avoid blank lines. (The @samp{>} is the shell's ``secondary prompt.''
|
|
This is what the shell prints when it notices you haven't finished
|
|
typing in all of a command.)
|
|
|
|
We now have data consisting of one word per line, no punctuation, all one
|
|
case. We're ready to count each word:
|
|
|
|
@example
|
|
$ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
|
|
> tr -s '[ ]' '\012' | sort | uniq -c | ...
|
|
@end example
|
|
|
|
At this point, the data might look something like this:
|
|
|
|
@example
|
|
60 a
|
|
2 able
|
|
6 about
|
|
1 above
|
|
2 accomplish
|
|
1 acquire
|
|
1 actually
|
|
2 additional
|
|
@end example
|
|
|
|
The output is sorted by word, not by count! What we want is the most
|
|
frequently used words first. Fortunately, this is easy to accomplish,
|
|
with the help of two more @code{sort} options:
|
|
|
|
@table @samp
|
|
@item -n
|
|
do a numeric sort, not an ASCII one
|
|
|
|
@item -r
|
|
reverse the order of the sort
|
|
@end table
|
|
|
|
The final pipeline looks like this:
|
|
|
|
@example
|
|
$ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
|
|
> tr -s '[ ]' '\012' | sort | uniq -c | sort -nr
|
|
156 the
|
|
60 a
|
|
58 to
|
|
51 of
|
|
51 and
|
|
...
|
|
@end example
|
|
|
|
Whew! That's a lot to digest. Yet, the same principles apply. With six
|
|
commands, on two lines (really one long one split for convenience), we've
|
|
created a program that does something interesting and useful, in much
|
|
less time than we could have written a C program to do the same thing.
|
|
|
|
A minor modification to the above pipeline can give us a simple spelling
|
|
checker! To determine if you've spelled a word correctly, all you have to
|
|
do is look it up in a dictionary. If it is not there, then chances are
|
|
that your spelling is incorrect. So, we need a dictionary. If you
|
|
have the Slackware Linux distribution, you have the file
|
|
@file{/usr/lib/ispell/ispell.words}, which is a sorted, 38,400 word
|
|
dictionary.
|
|
|
|
Now, how to compare our file with the dictionary? As before, we generate
|
|
a sorted list of words, one per line:
|
|
|
|
@example
|
|
$ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
|
|
> tr -s '[ ]' '\012' | sort -u | ...
|
|
@end example
|
|
|
|
Now, all we need is a list of words that are @emph{not} in the
|
|
dictionary. Here is where the @code{comm} command comes in.
|
|
|
|
@example
|
|
$ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
|
|
> tr -s '[ ]' '\012' | sort -u |
|
|
> comm -23 - /usr/lib/ispell/ispell.words
|
|
@end example
|
|
|
|
The @samp{-2} and @samp{-3} options eliminate lines that are only in the
|
|
dictionary (the second file), and lines that are in both files. Lines
|
|
only in the first file (standard input, our stream of words), are
|
|
words that are not in the dictionary. These are likely candidates for
|
|
spelling errors. This pipeline was the first cut at a production
|
|
spelling checker on Unix.
|
|
|
|
There are some other tools that deserve brief mention.
|
|
|
|
@table @code
|
|
@item grep
|
|
search files for text that matches a regular expression
|
|
|
|
@item egrep
|
|
like @code{grep}, but with more powerful regular expressions
|
|
|
|
@item wc
|
|
count lines, words, characters
|
|
|
|
@item tee
|
|
a T-fitting for data pipes, copies data to files and to standard output
|
|
|
|
@item sed
|
|
the stream editor, an advanced tool
|
|
|
|
@item awk
|
|
a data manipulation language, another advanced tool
|
|
@end table
|
|
|
|
The software tools philosophy also espoused the following bit of
|
|
advice: ``Let someone else do the hard part.'' This means, take
|
|
something that gives you most of what you need, and then massage it the
|
|
rest of the way until it's in the form that you want.
|
|
|
|
To summarize:
|
|
|
|
@enumerate 1
|
|
@item
|
|
Each program should do one thing well. No more, no less.
|
|
|
|
@item
|
|
Combining programs with appropriate plumbing leads to results where
|
|
the whole is greater than the sum of the parts. It also leads to novel
|
|
uses of programs that the authors might never have imagined.
|
|
|
|
@item
|
|
Programs should never print extraneous header or trailer data, since these
|
|
could get sent on down a pipeline. (A point we didn't mention earlier.)
|
|
|
|
@item
|
|
Let someone else do the hard part.
|
|
|
|
@item
|
|
Know your toolbox! Use each program appropriately. If you don't have an
|
|
appropriate tool, build one.
|
|
@end enumerate
|
|
|
|
As of this writing, all the programs we've discussed are available via
|
|
anonymous @code{ftp} from @code{prep.ai.mit.edu} as
|
|
@file{/pub/gnu/textutils-1.9.tar.gz}.@footnote{Version 1.9 was current
|
|
when this column was written. Check the nearest GNU archive for the
|
|
current version. The main GNU FTP site is now @code{ftp.gnu.org}.}
|
|
|
|
None of what I have presented in this column is new. The Software Tools
|
|
philosophy was first introduced in the book @cite{Software Tools},
|
|
by Brian Kernighan and P.J. Plauger (Addison-Wesley, ISBN
|
|
0-201-03669-X). This book showed how to write and use software
|
|
tools. It was written in 1976, using a preprocessor for FORTRAN named
|
|
@code{ratfor} (RATional FORtran). At the time, C was not as ubiquitous
|
|
as it is now; FORTRAN was. The last chapter presented a @code{ratfor}
|
|
to FORTRAN processor, written in @code{ratfor}. @code{ratfor} looks an
|
|
awful lot like C; if you know C, you won't have any problem following
|
|
the code.
|
|
|
|
In 1981, the book was updated and made available as @cite{Software
|
|
Tools in Pascal} (Addison-Wesley, ISBN 0-201-10342-7). Both books
|
|
remain in print, and are well worth reading if you're a programmer.
|
|
They certainly made a major change in how I view programming.
|
|
|
|
Initially, the programs in both books were available (on 9-track tape)
|
|
from Addison-Wesley. Unfortunately, this is no longer the case,
|
|
although you might be able to find copies floating around the Internet.
|
|
For a number of years, there was an active Software Tools Users Group,
|
|
whose members had ported the original @code{ratfor} programs to essentially
|
|
every computer system with a FORTRAN compiler. The popularity of the
|
|
group waned in the middle '80s as Unix began to spread beyond universities.
|
|
|
|
With the current proliferation of GNU code and other clones of Unix programs,
|
|
these programs now receive little attention; modern C versions are
|
|
much more efficient and do more than these programs do. Nevertheless, as
|
|
exposition of good programming style, and evangelism for a still-valuable
|
|
philosophy, these books are unparalleled, and I recommend them highly.
|
|
|
|
Acknowledgment: I would like to express my gratitude to Brian Kernighan
|
|
of Bell Labs, the original Software Toolsmith, for reviewing this column.
|
|
|
|
|
|
@node Index
|
|
@unnumbered Index
|
|
|
|
@printindex cp
|
|
|
|
@contents
|
|
@bye
|
|
|
|
@c Local variables:
|
|
@c texinfo-column-for-description: 32
|
|
@c End:
|