Commit Graph

132 Commits

Author SHA1 Message Date
Marc-André Lemburg
3c72ded23d Removed the decoding_map from the codecs where this is possible.
Replaced the tis_620, cp1140 and koi8_u codecs with new ones
based on custom mapping files.
2005-10-24 12:07:49 +00:00
Marc-André Lemburg
0f00ba8bd8 Replace the old EBCDIC codecs with new ones using the decoding table. 2005-10-21 14:35:35 +00:00
Marc-André Lemburg
7797be7b3b Alias iso8859_1 to latin_1 which is the same encoding, but has
a much faster codec implementation.
2005-10-21 14:02:28 +00:00
Marc-André Lemburg
75c9e8392e Add a few more Mac OS encodings. The mapping tables for these are
available at ftp.unicode.org.
2005-10-21 13:58:32 +00:00
Marc-André Lemburg
a1129f4b9b Replace the old charmap codecs with new ones generated from the current
mapping tables available at ftp.unicode.org.

These new codecs include and use character decoding tables which speeds
up decoding by a few factors.
2005-10-21 13:49:12 +00:00
Walter Dörwald
007f8dfde2 Bug #1245379: Add "unicode-1-1-utf-7" as an alias for "utf-7" as specified
by RFC 1642.
2005-10-09 19:42:27 +00:00
Neal Norwitz
4ce69a5b06 No need to import exceptions, they are builtins 2005-09-01 00:45:28 +00:00
Martin v. Löwis
8b59514e57 Make IDNA return an empty string when the input is empty. Fixes #1163178.
Will backport to 2.4.
2005-08-25 11:03:38 +00:00
Walter Dörwald
729c31f5c3 Reset internal buffers when seek() is called. This fixes SF bug #1156259. 2005-03-14 19:06:30 +00:00
Walter Dörwald
e1a0391b49 Fix wrong variable name. 2004-12-29 13:11:10 +00:00
Marc-André Lemburg
9ab8818c87 Rearranged mappings to value sorting order. 2004-12-10 21:54:35 +00:00
Walter Dörwald
69652035bc SF patch #998993: The UTF-8 and the UTF-16 stateful decoders now support
decoding incomplete input (when the input stream is temporarily exhausted).
codecs.StreamReader now implements buffering, which enables proper
readline support for the UTF-16 decoders. codecs.StreamReader.read()
has a new argument chars which specifies the number of characters to
return. codecs.StreamReader.readline() and codecs.StreamReader.readlines()
have a new argument keepends. Trailing "\n"s will be stripped from the lines
if keepends is false. Added C APIs PyUnicode_DecodeUTF8Stateful and
PyUnicode_DecodeUTF16Stateful.
2004-09-07 20:24:22 +00:00
Tim Peters
d1b7827216 Whitespace normalization. 2004-08-07 06:03:09 +00:00
Marc-André Lemburg
c759f070ef Added new codecs and aliases for ISO_8859-11, ISO_8859-16 and
TIS-620.

Closes SF bug #1001895: Adding missing ISO 8859 codecs, especially Thai.
2004-08-05 12:43:30 +00:00
Tim Peters
c0cbc8611b Whitespace normalization. 2004-07-31 21:17:37 +00:00
Marc-André Lemburg
17b6d28c64 New codec: [ 996067 ] hp-roman8 codec 2004-07-28 15:37:54 +00:00
Marc-André Lemburg
cd8a4cb3d3 Added new codec hp-roman8 submitted as patch [ 996067 ] hp-roman8 codec. 2004-07-28 15:35:29 +00:00
Hye-Shik Chang
2bb146f2f4 Bring CJKCodecs 1.1 into trunk. This completely reorganizes source
and installed layouts to make maintenance simple and easy.  And it
also adds four new codecs; big5hkscs, euc-jis-2004, shift-jis-2004
and iso2022-jp-2004.
2004-07-18 03:06:29 +00:00
Tim Peters
4e0e1b6a54 Whitespace normalization. 2004-07-07 20:54:48 +00:00
Martin v. Löwis
708b4dacf4 Convert input to a string object. Fixes #909230.
Backported 2.3.
2004-03-23 23:40:36 +00:00
Hye-Shik Chang
5c5316f111 Add a new unicode codec: ptcp154 (Kazakh) 2004-03-19 08:06:07 +00:00
Marc-André Lemburg
361d66de5d Fix wrong character mapping in koi8_u: SF bug #902501. 2004-02-23 09:00:43 +00:00
Marc-André Lemburg
c83dddf7fe Let the default encodings search function lookup aliases before trying the codec import. This allows applications to install codecs which override (non-special-cased) builtin codecs. 2004-01-20 09:40:14 +00:00
Marc-André Lemburg
5c94d33077 Add some more code page aliases needed for completeness. 2004-01-20 09:38:52 +00:00
Hye-Shik Chang
b619e4b36c Fix a typo: s/iso_3022/iso2022/ 2004-01-20 09:33:30 +00:00
Hye-Shik Chang
3e2a306920 Add CJK codecs support as discussed on python-dev. (SF #873597)
Several style fixes are suggested by Martin v. Loewis and
Marc-Andre Lemburg. Thanks!
2004-01-17 14:29:29 +00:00
Raymond Hettinger
0ad142aba0 Revert previous change. MAL preferred the old version. 2003-12-01 13:26:46 +00:00
Raymond Hettinger
a45517065a Simplifed the code. 2003-12-01 10:41:02 +00:00
Raymond Hettinger
9edae346dd Fix typo in the comments. 2003-09-24 03:57:36 +00:00
Raymond Hettinger
9a80c5dbc4 Added codec for bz2 compression. 2003-09-23 20:21:01 +00:00
Martin v. Löwis
0d8e16c7ad Support trailing dots in DNS names. Fixes #782510. Will backport to 2.3. 2003-08-05 06:19:47 +00:00
Skip Montanaro
5d6ceb4aae more generic reference to python interpreter 2003-07-22 14:37:42 +00:00
Marc-André Lemburg
2820125935 Remove usage of re module from encodings package search function. 2003-05-16 17:07:51 +00:00
Tim Peters
0eadaac7dc Whitespace normalization. 2003-04-24 16:02:54 +00:00
Martin v. Löwis
2548c730c1 Implement IDNA (Internationalized Domain Names in Applications). 2003-04-18 10:39:54 +00:00
Martin v. Löwis
7fb697b5d2 Revert Patch #670715: iconv support. 2003-04-03 04:49:12 +00:00
Neal Norwitz
6156a2d07c Handle iconv initialization erorrs 2003-02-28 20:00:42 +00:00
Martin v. Löwis
9789aefa61 Patch #670715: Universal Unicode Codec for POSIX iconv. 2003-01-26 11:30:36 +00:00
Tim Peters
6578dc925f Whitespace normalization. 2002-12-24 18:31:27 +00:00
Neal Norwitz
d8407a7031 Add new encoding for Ukrainian Cyrillic 2002-10-17 22:15:33 +00:00
Guido van Rossum
c8c6065231 When looking for an alias, first look for the normalized name (which
still may contain dots), then if that doesn't exist look for the name
with dots replaced by underscores.  This is a little more forgiving.
2002-10-04 20:49:05 +00:00
Marc-André Lemburg
8dc5ff2e5a Undo the removal. Guido mentioned that the encoding name is in active
by some email headers.
2002-10-04 16:30:42 +00:00
Marc-André Lemburg
68fc27385d Remove unneeded alias. 2002-10-04 15:57:03 +00:00
Marc-André Lemburg
a40ea75625 Fix doc-string. 2002-10-04 11:58:24 +00:00
Marc-André Lemburg
9d158bb66f Adapt lookup names to new more general encoding name normalization
scheme.
2002-10-04 11:51:39 +00:00
Marc-André Lemburg
7012673d67 Extending the encoding name normalization to handle more non-alphanumeric
characters.
2002-10-04 11:45:38 +00:00
Guido van Rossum
479f3d3d2a Oops, must convert hyphens to underscores in keys of aliases dict. 2002-09-26 20:08:23 +00:00
Guido van Rossum
b7a88e533d Add yet another alias for ASCII found in the field. Will backport to
2.2.2.
2002-09-25 16:44:34 +00:00
Tim Peters
280488b9a3 Whitespace normalization. 2002-08-23 18:19:30 +00:00
Martin v. Löwis
8a8da798a5 Patch #505705: Remove eval in pickle and cPickle. 2002-08-14 07:46:28 +00:00
Tim Peters
469cdad822 Whitespace normalization. 2002-08-08 20:19:19 +00:00
Martin v. Löwis
b9e0764d8b Revert #571603 since it is ok to import codecs that are not subdirectories
of encodings. Skip modules that don't have a getregentry function.
2002-07-29 14:05:24 +00:00
Martin v. Löwis
fc4c24c142 Patch #571603: Refer to encodings package explicitly. 2002-07-28 11:31:33 +00:00
Marc-André Lemburg
a83ffa89f2 Palm OS encoding from Sjoerd Mullender 2002-07-12 14:36:22 +00:00
Marc-André Lemburg
3ccb09cba3 Fix for bug #222395: UTF-16 et al. don't handle .readline().
They now raise an NotImplementedError to hint to the truth ;-)
2002-04-05 12:12:00 +00:00
Marc-André Lemburg
a0af63b242 Corrected import behaviour for codecs which live outside the encodings
package.
2002-02-11 17:43:46 +00:00
Marc-André Lemburg
462004e90a Add IANA character set aliases to the encodings alias dictionary
and make alias lookup lazy.

Note that only those IANA character set aliases were added for which
we actually have codecs in the encodings package.
2002-02-10 21:36:20 +00:00
Martin v. Löwis
79d802d58c Patch #487275: Add windows-1251 charset alias. 2001-12-02 12:24:19 +00:00
Marc-André Lemburg
35b0cb09d7 Python part of the UTF-7 codec by Brian Quinlan. 2001-09-20 12:56:14 +00:00
Marc-André Lemburg
c60e6f7771 Patch #435971: UTF-7 codec by Brian Quinlan. 2001-09-20 10:35:46 +00:00
Marc-André Lemburg
26e3b681b2 Patch #462635 by Andrew Kuchling correcting bugs in the new
codecs -- the self argument does matter for Python functions (it
does not for C functions which most other codecs use).
2001-09-20 10:33:38 +00:00
Marc-André Lemburg
816a1b75b7 Fixed search function error reporting in the encodings package
__init__.py module to raise errors which can be catched as LookupErrors
as well as SystemErrors.

Modified the error messages to include more information about the
failing module.
2001-09-19 11:52:07 +00:00
Andrew M. Kuchling
fd6608bcea Fix typo (PyChecker) 2001-08-13 13:48:55 +00:00
Martin v. Löwis
9b75dca192 Expose nl_langinfo through locale where available. 2001-08-10 13:58:50 +00:00
Marc-André Lemburg
92b550cdd8 This patch by Martin v. Loewis changes the UTF-16 codec to only
write a BOM at the start of the stream and also to only read it as
BOM at the start of a stream.

Subsequent reading/writing of BOMs will read/write the BOM as ZWNBSP
character. This is in sync with the Unicode specifications.

Note that UTF-16 files will now *have* to start with a BOM mark
in order to be readable by the codec.
2001-06-19 20:07:51 +00:00
Martin v. Löwis
13b8bc5478 Patch #429957: Add support for cp1140, which is identical to cp037,
with the addition of the euro character.
Also added a few EDBDIC aliases.
2001-06-07 19:39:25 +00:00
Mark Hammond
194bfb2805 Add some useful Windows encodings - patch #423221. 2001-06-04 02:31:23 +00:00
Marc-André Lemburg
716cf91839 Moved the encoding map building logic from the individual mapping
codec files to codecs.py and added logic so that multi mappings
in the decoding maps now result in mappings to None (undefined mapping)
in the encoding maps.
2001-05-16 09:41:45 +00:00
Guido van Rossum
acfdf156aa Add quoted-printable codec 2001-05-15 15:34:07 +00:00
Marc-André Lemburg
2d9204199f This patch changes the way the string .encode() method works slightly
and introduces a new method .decode().

The major change is that strg.encode() will no longer try to convert
Unicode returns from the codec into a string, but instead pass along
the Unicode object as-is. The same is now true for all other codec
return types. The underlying C APIs were changed accordingly.

Note that even though this does have the potential of breaking
existing code, the chances are low since conversion from Unicode
previously took place using the default encoding which is normally
set to ASCII rendering this auto-conversion mechanism useless for
most Unicode encodings.

The good news is that you can now use .encode() and .decode() with
much greater ease and that the door was opened for better accessibility
of the builtin codecs.

As demonstration of the new feature, the patch includes a few new
codecs which allow string to string encoding and decoding (rot13,
hex, zip, uu, base64).

Written by Marc-Andre Lemburg. Copyright assigned to the PSF.
2001-05-15 12:00:02 +00:00
Marc-André Lemburg
a866df806d This patch changes the default behaviour of the builtin charmap
codec to not apply Latin-1 mappings for keys which are not found
in the mapping dictionaries, but instead treat them as undefined
mappings.

The patch was originally written by Martin v. Loewis with some
additional (cosmetic) changes and an updated test script
by Marc-Andre Lemburg.

The standard codecs were recreated from the most current files
available at the Unicode.org site using the Tools/scripts/gencodec.py
tool.

This patch closes the bugs #116285 and #119960.
2001-01-03 21:29:14 +00:00
Marc-André Lemburg
988ad2bdff Changed .getaliases() support to register the new aliases in the
encodings package aliases mapping dictionary rather than in the
internal cache used by the search function.

This enables aliases to take advantage of the full normalization
process applied to encoding names which was previously not available.

The patch restricts alias registration to new aliases. Existing
aliases cannot be overridden anymore.
2000-12-12 14:45:35 +00:00
Thomas Wouters
7e47402264 Spelling fixes supplied by Rob W. W. Hooft. All these are fixes in either
comments, docstrings or error messages. I fixed two minor things in
test_winreg.py ("didn't" -> "Didn't" and "Didnt" -> "Didn't").

There is a minor style issue involved: Guido seems to have preferred English
grammar (behaviour, honour) in a couple places. This patch changes that to
American, which is the more prominent style in the source. I prefer English
myself, so if English is preferred, I'd be happy to supply a patch myself ;)
2000-07-16 12:04:32 +00:00
Marc-André Lemburg
7ebb92ea66 Marc-Andre Lemburg <mal@lemburg.com>:
Removed import of string module -- use string methods directly.
Thanks to Finn Bock.
2000-06-13 12:04:05 +00:00
Marc-André Lemburg
4fd73f0465 Marc-Andre Lemburg <mal@lemburg.com>:
Added some more codec aliases. Some of them are needed by the
new locale.py encoding support.
2000-06-07 09:12:30 +00:00
Marc-André Lemburg
54480d300a New codec which always raises an exception when used. This
codec can be used to effectively switch off string coercion
to Unicode.
2000-06-07 09:04:05 +00:00
Guido van Rossum
9e896b37c7 Marc-Andre's third try at this bulk patch seems to work (except that
his copy of test_contains.py seems to be broken -- the lines he
deleted were already absent).  Checkin messages:


New Unicode support for int(), float(), complex() and long().

- new APIs PyInt_FromUnicode() and PyLong_FromUnicode()
- added support for Unicode to PyFloat_FromString()
- new encoding API PyUnicode_EncodeDecimal() which converts
  Unicode to a decimal char* string (used in the above new
  APIs)
- shortcuts for calls like int(<int object>) and float(<float obj>)
- tests for all of the above

Unicode compares and contains checks:
- comparing Unicode and non-string types now works; TypeErrors
  are masked, all other errors such as ValueError during
  Unicode coercion are passed through (note that PyUnicode_Compare
  does not implement the masking -- PyObject_Compare does this)
- contains now works for non-string types too; TypeErrors are
  masked and 0 returned; all other errors are passed through

Better testing support for the standard codecs.

Misc minor enhancements, such as an alias dbcs for the mbcs codec.

Changes:
- PyLong_FromString() now applies the same error checks as
  does PyInt_FromString(): trailing garbage is reported
  as error and not longer silently ignored. The only characters
  which may be trailing the digits are 'L' and 'l' -- these
  are still silently ignored.
- string.ato?() now directly interface to int(), long() and
  float(). The error strings are now a little different, but
  the type still remains the same. These functions are now
  ready to get declared obsolete ;-)
- PyNumber_Int() now also does a check for embedded NULL chars
  in the input string; PyNumber_Long() already did this (and
  still does)

Followed by:

Looks like I've gone a step too far there... (and test_contains.py
seem to have a bug too).

I've changed back to reporting all errors in PyUnicode_Contains()
and added a few more test cases to test_contains.py (plus corrected
the join() NameError).
2000-04-05 20:11:21 +00:00
Guido van Rossum
68895ed70c Marc-Andre Lemburg: use all lowercase names. 2000-03-31 17:23:18 +00:00
Guido van Rossum
24bdb0474f Marc-Andre Lemburg:
The attached patch set includes a workaround to get Python with
Unicode compile on BSDI 4.x (courtesy Thomas Wouters; the cause
is a bug in the BSDI wchar.h header file) and Python interfaces
for the MBCS codec donated by Mark Hammond.

Also included are some minor corrections w/r to the docs of
the new "es" and "es#" parser markers (use PyMem_Free() instead
of free(); thanks to Mark Hammond for finding these).

The unicodedata tests are now in a separate file
(test_unicodedata.py) to avoid problems if the module cannot
be found.
2000-03-28 20:29:59 +00:00
Guido van Rossum
1abd82c07d MBCS codecs for Windows. Contributed by Mark Hammond. 2000-03-28 01:58:50 +00:00
Barry Warsaw
51ac58039f On 17-Mar-2000, Marc-Andre Lemburg said:
Attached you find an update of the Unicode implementation.

    The patch is against the current CVS version. I would appreciate
    if someone with CVS checkin permissions could check the changes
    in.

    The patch contains all bugs and patches sent this week and also
    fixes a leak in the codecs code and a bug in the free list code
    for Unicode objects (which only shows up when compiling Python
    with Py_DEBUG; thanks to MarkH for spotting this one).
2000-03-20 16:36:48 +00:00
Guido van Rossum
0229bf6001 Marc-Andre Lemburg: Unicode encodings. 2000-03-10 23:17:24 +00:00