Now available at http://EmpowermentZone.com/Encoding.zip Encoding Version 1.0 August 8, 2010 Copyright 2010 by Jamal Mazrui GNU Lesser General Public License (LGPL) ---------- Contents Description Installation Operation Development Notes ---------- DescriptionEncoding is a free, open source, command-line utility for performing encoding-related operations on files. It can show the encoding of files, and convert between different encodings. Batch operations are supported if wildcard characters are used in the file specification. The executable, Encoding.exe, should run on any version of Windows. The source code, Encoding.py, should run on other platforms as well.
An encoding is an agreement about how to represent textual characters with computer bytes. Characters are encoded as byte sequences that may be stored in disk files or computer memory. A byte stream is decoded to produce characters in a human language. If a text file is not readable, the reason may be that it has an encoding that was either not recognized or not decoded properly. This utility may help with such issues, benefiting software developers or end users. It works with over a hundred character encodings.
---------- Installation Unarchive Encoding.zip into a directory, e.g., into C:\Encoding Run Encoding.exe at a command prompt, e.g., one created by entering cmd.exe at the Windows Start/Run prompt.Since Encoding is developed in a cross-platform language, Python, it should also be possible to run the source code, Encoding.py, on other platforms that have a Python interpreter.
---------- Operation The complete command-line syntax of Encoding is as follows: Encoding.exe TaskName FileSpec SourceEncoding TargetEncodingSome parameters are optional or not applicable depending on the name of the task. Typing the .exe extension is optional. Capitalization does not matter in task or encoding names . The following tasks are supported, illustrated with example parameter values:
encoding helpprovides a help summary. The help parameter is assumed if no other valid task name is entered.
encoding default provides the default language and encoding of the computer, e.g., en-us cp1252 which means U.S. English using code page 1252. encoding show *.txtprovides the encoding of all files meeting the *.txt specification. If a file has a Unicode byte order mark (BOM), the encoding can be exactly determined. Otherwise, the encoding is huristically detected by analyzing various factors. This is the same algorithm used by the Firefox web browser to detect the encoding of text. It is usually correct, but not always.
encoding convert *.txt utf-8bconverts all *.txt files to UTF-8 encoding with a BOM. Use utf-8n to get utf-8 without a BOM, which is the norm on Linux and the Mac. For ease of typing, the dash character (-) is optional, so utf8b or utf8n may be used instead. Note that these are not official encoding names, but conventions to help clarify whether utf-8 is being encoded with or without a BOM. Some Windows programs prefer one, while others do not.
encode convert *.txt utf8n utf8bconverts *.txt files to UTF8 with a BOM. In this case, both a source and target encoding are specified. Rather than detecting the source encoding, it is treated as UTF-8 without a BOM.
If the word 'backup' rather than 'convert' is used for the task, the original files will be backed up with the same names except for the addition of a .bak extension.
encode url http://python.orgprovides encoding information about the web page at that address. Encoding references are sought in the server response headers and meta data of the page. A conflict between encoding references is reported.
encoding bytes *.txtprovides a list of numeric byte values, one per line, for all files matching the pattern. The first line is the file name. This is probably most useful when analyzing a single source file, and when redirecting standard output to another file that may be examined in an editor, e.g.,
encoding bytes test.txt >temp.txt encoding chars temp.txt >test.txtprovides output in a similar form except that each line shows information about a character rather than a byte (Unicode can represent a character with multiple bytes). Each line has the Unicode name of the character, its numeric code point, and an ASCII equivalent of the character if available and different from the original character. For example, the ellipses symbol has the code point U2026, and an ASCII equivalent of three consecutive periods (...), so it would appear as
HORIZONTAL ELLIPSIS 8230 ...Add a SourceEncoding parameter to specify the file's encoding directly, rather than auto-detect it.
---------- Development Notes The Encoding utility is developed with the Python 2.5 language from http://python.orgThe following built-in packages are used: codecs, glob, locale, os, shutil, sys, and unicodedata.
The following third-party packages are used: chardet -- Universal encoding detector http://chardet.feedparser.org encutils -- Encoding detection collection for Python http://cthedot.de/encutils/ py2exe -- Build standalone executables for Windows http://py2exe.org unidecode -- Unicode transliteration in Python http://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/The batch file, RunSetup.bat, runs the py2exe script, setup.py, to create the stand-alone executable, Encoding.exe.
I welcome feedback, suggestions, and code contributions, which will help this project improve over time.
** To leave the list, click on the immediately-following link:- ** [mailto:guispeak-request@xxxxxxxxxxxxx?subject=unsubscribe] ** If this link doesn't work then send a message to: ** guispeak-request@xxxxxxxxxxxxx ** and in the Subject line type ** unsubscribe ** For other list commands such as vacation mode, click on the ** immediately-following link:- ** [mailto:guispeak-request@xxxxxxxxxxxxx?subject=faq] ** or send a message, to ** guispeak-request@xxxxxxxxxxxxx with the Subject:- faq