Announcing Encoding utility

From: Jamal Mazrui <empower@xxxxxxxxx>
To: ProgrammingBlind@xxxxxxxxxxxxx, Program-L@xxxxxxxxxxxxx, GUISpeak@xxxxxxxxxxxxx
Date: Wed, 18 Aug 2010 17:11:30 -0400 (EDT)

Now available at
http://EmpowermentZone.com/Encoding.zip

Encoding
Version 1.0
August 8, 2010
Copyright 2010 by Jamal Mazrui
GNU Lesser General Public License (LGPL)
----------

Contents

Description
Installation
Operation
Development Notes
----------

Description

Encoding is a free, open source, command-line utility for performingencoding-related operations on files. It can show the encoding of files,and convert between different encodings. Batch operations are supportedif wildcard characters are used in the file specification. Theexecutable, Encoding.exe, should run on any version of Windows. Thesource code, Encoding.py, should run on other platforms as well.

An encoding is an agreement about how to represent textual characters withcomputer bytes. Characters are encoded as byte sequences that may bestored in disk files or computer memory. A byte stream is decoded toproduce characters in a human language. If a text file is not readable,the reason may be that it has an encoding that was either not recognizedor not decoded properly. This utility may help with such issues,benefiting software developers or end users. It works with over a hundredcharacter encodings.

----------

Installation

Unarchive Encoding.zip into a directory, e.g., into
C:\Encoding

Run Encoding.exe at a command prompt, e.g., one created by entering
cmd.exe

at the Windows Start/Run prompt.

Since Encoding is developed in a cross-platform language, Python, itshould also be possible to run the source code, Encoding.py, on otherplatforms that have a Python interpreter.

----------

Operation

The complete command-line syntax of Encoding is as follows:

Encoding.exe TaskName FileSpec SourceEncoding TargetEncoding

Some parameters are optional or not applicable depending on the name ofthe task. Typing the .exe extension is optional. Capitalization does notmatter in task or encoding names . The following tasks are supported,illustrated with example parameter values:


encoding help

provides a help summary. The help parameter is assumed if no other validtask name is entered.


encoding default

provides the default language and encoding of the computer, e.g.,
en-us cp1252

which means U.S. English using code page 1252.

encoding show *.txt

provides the encoding of all files meeting the *.txt specification. If afile has a Unicode byte order mark (BOM), the encoding can be exactlydetermined. Otherwise, the encoding is huristically detected by analyzingvarious factors. This is the same algorithm used by the Firefox webbrowser to detect the encoding of text. It is usually correct, but notalways.


encoding convert *.txt utf-8b

converts all *.txt files to UTF-8 encoding with a BOM. Use utf-8n to getutf-8 without a BOM, which is the norm on Linux and the Mac. For ease oftyping, the dash character (-) is optional, so utf8b or utf8n may be usedinstead. Note that these are not official encoding names, but conventionsto help clarify whether utf-8 is being encoded with or without a BOM.Some Windows programs prefer one, while others do not.


encode convert *.txt utf8n utf8b

converts *.txt files to UTF8 with a BOM. In this case, both a source andtarget encoding are specified. Rather than detecting the source encoding,it is treated as UTF-8 without a BOM.

If the word 'backup' rather than 'convert' is used for the task, theoriginal files will be backed up with the same names except for theaddition of a .bak extension.


encode url http://python.org

provides encoding information about the web page at that address.Encoding references are sought in the server response headers and metadata of the page. A conflict between encoding references is reported.


encoding bytes *.txt

provides a list of numeric byte values, one per line, for all filesmatching the pattern. The first line is the file name. This is probablymost useful when analyzing a single source file, and when redirectingstandard output to another file that may be examined in an editor, e.g.,

encoding bytes test.txt >temp.txt

encoding chars temp.txt >test.txt

provides output in a similar form except that each line shows informationabout a character rather than a byte (Unicode can represent a characterwith multiple bytes). Each line has the Unicode name of the character,its numeric code point, and an ASCII equivalent of the character ifavailable and different from the original character. For example, theellipses symbol has the code point U2026, and an ASCII equivalent ofthree consecutive periods (...), so it would appear as

HORIZONTAL ELLIPSIS 8230 ...

Add a SourceEncoding parameter to specify the file's encoding directly,rather than auto-detect it.

----------

Development Notes

The Encoding utility is developed with the Python 2.5 language from
http://python.org

The following built-in packages are used: codecs, glob, locale, os,shutil, sys, and unicodedata.


The following third-party packages are used:

chardet -- Universal encoding detector
http://chardet.feedparser.org

encutils -- Encoding detection collection for Python
http://cthedot.de/encutils/

py2exe -- Build standalone executables for Windows
http://py2exe.org

unidecode -- Unicode transliteration in Python
http://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/

The batch file, RunSetup.bat, runs the py2exe script, setup.py, to createthe stand-alone executable, Encoding.exe.

I welcome feedback, suggestions, and code contributions, which will helpthis project improve over time.


__________

View the list's information and change your settings at//www.freelists.org/list/programmingblind

Follow-Ups:
- Re: Announcing Encoding utility
  - From: Alex Midence

Announcing Encoding utility

Other related posts: