Re: Announcing Encoding utility

Fascinating.  Learn something new every day.



On 8/19/10, DaShiell, Jude T.  CIV NAVAIR 1490, 1, 26
<jude.dashiell@xxxxxxxx> wrote:
> You may find such a package useful the first time you get handed a bunch
> of data files on which to work.  Confidence building measures can be to
> check encoding on all files to make sure the files are as described to
> you.  That way you know what tool or tools to use to best handle the
> files be it for searches or modification.  Just because a file has a
> .csv extension on it doesn't at all guarantee its contents are actually
> comma separated values; it could be an executable or actually html or
> any number of other formats.  In the Linux world, one of the lighter
> versions of the tools we have is called file.  You can run file prog.csv
> and get back information about prog.csv and know before going any
> further with the file.
>
> -----Original Message-----
> From: programmingblind-bounce@xxxxxxxxxxxxx
> [mailto:programmingblind-bounce@xxxxxxxxxxxxx] On Behalf Of Alex Midence
> Sent: Wednesday, August 18, 2010 22:36
> To: programmingblind@xxxxxxxxxxxxx
> Subject: Re: Announcing Encoding utility
>
> Interesting.  I can't personally conceive of when I would need such an
> app as of yet but, I'm new to developing so that may change.  It's
> still rather intriguing, however.
>
> Alex M
>
> On 8/18/10, Jamal Mazrui <empower@xxxxxxxxx> wrote:
>> Now available at
>> http://EmpowermentZone.com/Encoding.zip
>>
>> Encoding
>> Version 1.0
>> August 8, 2010
>> Copyright 2010 by Jamal Mazrui
>> GNU Lesser General Public License (LGPL)
>> ----------
>>
>> Contents
>>
>> Description
>> Installation
>> Operation
>> Development Notes
>> ----------
>>
>> Description
>>
>> Encoding is a free, open source, command-line utility for performing
>> encoding-related operations on files.  It can show the encoding of
> files,
>> and convert between different encodings.  Batch operations are
> supported
>> if wildcard characters are used in the file specification.  The
>> executable, Encoding.exe, should run on any version of Windows.  The
>> source code, Encoding.py, should run on other platforms as well.
>>
>> An encoding is an agreement about how to represent textual characters
> with
>> computer bytes.  Characters are encoded as byte sequences that may be
>> stored in disk files or computer memory.  A byte stream is decoded to
>> produce characters in a human language.  If a text file is not
> readable,
>> the reason may be that it has an encoding that was either not
> recognized
>> or not decoded properly.  This utility may help with such issues,
>> benefiting software developers or end users.  It works with over a
> hundred
>> character encodings.
>> ----------
>>
>> Installation
>>
>> Unarchive Encoding.zip into a directory, e.g., into
>> C:\Encoding
>>
>> Run Encoding.exe at a command prompt, e.g., one created by entering
>> cmd.exe
>>
>> at the Windows Start/Run prompt.
>>
>> Since Encoding is developed in a cross-platform language, Python, it
>> should also be possible to run the source code, Encoding.py, on other
>> platforms that have a Python interpreter.
>> ----------
>>
>> Operation
>>
>> The complete command-line syntax of Encoding is as follows:
>>
>> Encoding.exe TaskName FileSpec SourceEncoding TargetEncoding
>>
>> Some parameters are optional or not applicable depending on the name
> of
>> the task.  Typing the .exe extension is optional.  Capitalization does
> not
>> matter in task or encoding names .  The following tasks are supported,
>> illustrated with example parameter values:
>>
>> encoding help
>>
>> provides a help summary.  The help parameter is assumed if no other
> valid
>> task name is entered.
>>
>> encoding default
>>
>> provides the default language and encoding of the computer, e.g.,
>> en-us cp1252
>>
>> which means U.S. English using code page 1252.
>>
>> encoding show *.txt
>>
>> provides the encoding of all files meeting the *.txt specification.
> If a
>> file has a Unicode byte order mark (BOM), the encoding can be exactly
>> determined.  Otherwise, the encoding is huristically detected by
> analyzing
>> various factors.  This is the same algorithm used by the Firefox web
>> browser to detect the encoding of text.  It is usually correct, but
> not
>> always.
>>
>> encoding convert *.txt utf-8b
>>
>> converts all *.txt files to UTF-8 encoding with a BOM.  Use utf-8n to
> get
>> utf-8 without a BOM, which is the norm on Linux and the Mac.  For ease
> of
>> typing, the dash character (-) is optional, so utf8b or utf8n may be
> used
>> instead.  Note that these are not official encoding names, but
> conventions
>> to help clarify whether utf-8 is being encoded with or without a BOM.
>> Some Windows programs prefer one, while others do not.
>>
>> encode convert *.txt utf8n utf8b
>>
>> converts *.txt files to UTF8 with a BOM.  In this case, both a source
> and
>> target encoding are specified.  Rather than detecting the source
> encoding,
>> it is treated as UTF-8 without a BOM.
>>
>> If the word 'backup' rather than 'convert' is used for the task, the
>> original files will be backed up with the same names except for the
>> addition of a .bak extension.
>>
>> encode url http://python.org
>>
>> provides encoding information about the web page at that address.
>> Encoding references are sought in the server response headers and meta
>> data of the page.  A conflict between encoding references is reported.
>>
>> encoding bytes *.txt
>>
>> provides a list of numeric byte values, one per line, for all files
>> matching the pattern.  The first line is the file name.  This is
> probably
>> most useful when analyzing a single source file, and when redirecting
>> standard output to another file that may be examined in an editor,
> e.g.,
>> encoding bytes test.txt >temp.txt
>>
>> encoding chars temp.txt >test.txt
>>
>> provides output in a similar form except that each line shows
> information
>> about a character rather than a byte (Unicode can represent a
> character
>> with multiple bytes).  Each line has the Unicode name of the
> character,
>> its numeric code point, and an ASCII equivalent of the character if
>> available and different from the original character.  For example, the
>> ellipses symbol  has the code point U2026, and an ASCII equivalent of
>> three consecutive periods (...), so it would appear as
>> HORIZONTAL ELLIPSIS 8230 ...
>>
>>
>> Add a SourceEncoding parameter to specify the file's encoding
> directly,
>> rather than auto-detect it.
>> ----------
>>
>> Development Notes
>>
>> The Encoding utility is developed with the Python 2.5 language from
>> http://python.org
>>
>> The following built-in packages are used:  codecs, glob, locale, os,
>> shutil, sys, and unicodedata.
>>
>> The following third-party packages are used:
>>
>> chardet -- Universal encoding detector
>> http://chardet.feedparser.org
>>
>> encutils -- Encoding detection collection for Python
>> http://cthedot.de/encutils/
>>
>> py2exe -- Build standalone executables for Windows
>> http://py2exe.org
>>
>> unidecode -- Unicode transliteration in Python
>>
> http://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliterati
> on_in_python/
>>
>> The batch file, RunSetup.bat, runs the py2exe script, setup.py, to
> create
>> the stand-alone executable, Encoding.exe.
>>
>> I welcome feedback, suggestions, and code contributions, which will
> help
>> this project improve over time.
>>
>> __________
>> View the list's information and change your settings at
>> http://www.freelists.org/list/programmingblind
>>
>>
> __________
> View the list's information and change your settings at
> http://www.freelists.org/list/programmingblind
>
> __________
> View the list's information and change your settings at
> http://www.freelists.org/list/programmingblind
>
>
__________
View the list's information and change your settings at 
http://www.freelists.org/list/programmingblind

Other related posts: