[slikom] PDF

  • From: "Gradimir Kragic" <bastono@xxxxxxxx>
  • To: "\"SliKom\"" <slikom@xxxxxxxxxxxxx>
  • Date: Sat, 13 May 2006 03:11:45 +0200


     Zdravo svima,

     Ovih dana bijase nekoliko poruka na temu PDF dokumenata i njihovog 
citanja. 
Evo jednog teksta koji govori sve ono sto bi se trebalo znati o PDF 
dokumentima. 
Autor teksta je Jamal Mazrui a on je autor i programa PDF2TXT kojeg mozete 
skinuti sa linka kojeg ste nasli na web portalu slikom. Teks je sjajno napisan 
i 
jednostavno objasnjava. Problem koji slijedi je to sto je tekst na Engleskom 
jeziku pa mozda mnogima nece biti od velike koristi. Programerima sa ove liste 
svakako hoce.

Gradimir

P.S. Na ovaj tekst nemozete odgovarati sa reply. Tekst je veci od 32 KB pa ako 
jos vi nesto dodate nece proci preko liste s obzirom na ogranicenje koje je 
postavljeno na 32,384 KB. Ako neko zeli da komentarise neka pise novu poruku.



What's in a PDF? The Challenges of the Popular Portable Document Format

Autor: Jamal Mazrui

Portable Document Format (PDF) is an electronic file format developed by Adobe
Systems of San Jose, California. PDF has become one of the most popular file
formats for publishing documents on the Web and is thus a common medium for the
dissemination of knowledge. This article identifies features behind the
popularity of PDF, analyzes their impact on accessibility, and discusses the use
of the Adobe Reader program with a screen reader, such as JAWS or Window-Eyes.

Popular Features

Adobe publishes an official specification of PDF, which has evolved over the
years to version 1.6 at present. Compared to other formats that can be used
for storing and distributing documents electronically, such as HTML or Microsoft
Word, PDF is distinguished by at least four features: visual fidelity,
compact storage, security settings, and cross-platform portability.

Visual Fidelity

By preparing a document in PDF, one can be reasonably confident that the precise
visual appearance that is intended is presented to the reader, including
layout, fonts, colors, and pictures. This is true whether the output is
displayed on the computer screen or printed as hard copy. Since a PDF file is
internally
divided into pages of output, each page of an author's work will have the look
and feel that he or she wants to convey. This visual fidelity is a reason
why PDF is widely used for distributing publications in electronic form.

Compact Storage

A document in HTML format is typically divided into multiple files that are
presented as separate pages on a web site. Moreover, pictures are further
separated
as graphics files that are linked to the text pages. Thus, distributing a
document in HTML usually involves collecting various files at the source and
placing them in an appropriate arrangement at the destination for the document
to be coherent.

If a document is prepared in PDF, on the other hand, all the text and graphics
are bound in a single file. In addition, this file is compressed: Techniques
are used for storing repeating sequences of data in more compact ways, thus
reducing the total size. The software for viewing a PDF file automatically
decompresses the data as it presents its content in readable form. This compact
storage means that a web site can store publications in a single file that
corresponds to each document, a user can download them faster, and both sending
and receiving are easier.

Security Settings

PDF contains optional settings that an author can incorporate to limit how a PDF
file is used. Without such restrictions, the Adobe Reader program permits
a user to view a PDF file on the screen, print it, copy it to the clipboard, and
save it to disk in plain text format. With security settings, however,
any of the uses besides on-screen viewing may be blocked completely or limited
in some way. For example, only a portion may be copied to the clipboard
or only a range of pages may be printed once a week. Stricter settings can
prevent a PDF file from being viewed on any computer that does not contain a
license key for a specific PDF file. The mechanism is similar to those that are
sometimes used to prevent unauthorized copying of software to other computers.
These security settings mean that authors can choose to limit who uses their
documents and how.

Cross-platform Portability

An integral piece of PDF support is the free software that Adobe also develops
for viewing PDF files on several different computer platforms or operating
systems, including Microsoft Windows, Apple Macintosh, UNIX, and handheld
personal digital assistants. The Adobe Reader program ensures that a PDF file
can be viewed with the same visual fidelity on almost any type of computer.
Since these programs may be obtained without charge, the cost of the Adobe
Reader software is not an obstacle to viewing a document that is available in
PDF. This cross-platform portability means that authors can disseminate their
works widely.

Accessibility Challenges

The popularity of PDF as a means of distributing publications has some benefit
for people who are blind or have impaired vision. In general, electronic
publications offer more potential for accessible, independent reading than do
print publications, since computer programs can produce output in flexible
and alternative ways, including synthetic speech, braille, and magnified text.
This means that an intermediary sighted assistant is not needed, thus providing
convenience and privacy. The benefits of PDF, previously discussed, help to
increase the amount of reading material that is published in electronic form.
In addition, someone who is visually impaired benefits directly, as others do,
from particular PDF features, such as compact storage.

Yet, some PDF features that provide benefits of a general nature have had
inadvertent adverse side effects for nonvisual readers. To understand why, this
section explains some technical inner workings of PDF. The specification for the
current version 1.6 is over 1,200 pages long. To keep within the scope
of this article, the discussion will necessarily simplify a technical
explanation of the format, focusing on the concepts most relevant to
accessibility.

The PostScript Language

PDF originates in a specialized programming language, called PostScript,
developed by Adobe in the 1980s. Part of the power of PostScript derives from
its
flexibility about the order in which parts of output are placed on a page. The
order does not have to be from left to right and top to bottom. A
PostScript-enabled
printer produces output a page at a time. Each page of output is transmitted as
a batch after all drawing operations on it are complete. An observer of
the visual page may guess, but does not actually know, in what order the output
was drawn.

Three Components of Output

Producing output may be subdivided into drawing three components: textual
characters, vector graphics, and photographic images. How these different
objects
are used and combined has implications for accessibility, as explained later.

Textual Characters

Textual characters are based on a font table: a set of associations between the
visible form of a character and its numeric value in a system called Unicode.
The historically popular code called ASCII (American Standard Code for
Information Interchange) defines about 250 possible characters, which typically
suffice for expressing English and other European languages. Unicode, by
comparison, defines tens of thousands of characters in order to support numerous
written languages of the world, as well as many specialized symbols used in
particular subject areas. A PostScript program draws a string of characters
on a page by using the Unicode value of each character and looking up its
associated shape in a font table.

Vector Graphics

Besides textual characters, many other kinds of shapes may be drawn on a page
based on mathematical calculations. Such shapes--called vector graphics--may
be straight or curved lines, geometric designs such as circles or squares, or
filled areas according to a pattern. In fact, PostScript can draw vector
graphics to create a picture of almost anything on a page.

Photographic Images

A third component of output is a photographic image, which may be thought of as
an array of colored dots that create a literal picture. PostScript does
not know the internal structure of an image, so it essentially copies rather
than generates it to a particular location on the page. Such images are
typically
defined in a format called TIFF (Tag Image File Format).

The PDF File Type

Adobe built PDF as a file type on the foundation of PostScript as a printing
language. PDF is a way that documents can be viewed on the screen and exchanged
among users, not just printed onto paper. PDF uses the same "imaging model" as
PostScript for describing how a page looks. A PDF file contains an abbreviated
set of PostScript instructions: basically, a sequence of drawing operations
without other programming constructs such as conditions and loops.

Hence, a PDF document is a file that contains PostScript instructions and the
data they use. The commands and data follow certain rules that Adobe has defined
as the specification for Portable Document Format. As opposed to a file format
whose internal structure is only known by its developers, the PDF specification
is published and open rather than private and proprietary. It is copyrighted and
controlled by Adobe, but anyone is free to use it for developing software
that either creates or views PDF files within general licensing terms. Adobe
also publishes a free viewing and printing program for many different devices
so that all understand PDF in the same way. Adobe has, therefore, established
the combination of a file format and software interpreter that enables authors
to publish documents with a certain look and feel for potential readers in a
broad variety of environments.

Three Types of PDF Files

PDF files may be subdivided into three types: image-only, searchable image, and
formatted text and graphics. These types differ in their use of the different
components just described--textual characters, vector graphics, and photographic
images.

Image-Only PDF

An image-only PDF contains a photographic image representing each page, and
virtually no textual characters or vector graphics. Although text may appear
on a page, the text is actually a surface picture without underlying characters.
Individual characters are needed for translation into speech or braille,
so an image-only PDF file is inaccessible.

Image-only PDF files are usually created by scanning hard-copy documents into a
computer with attached scanning equipment. Essentially, the system takes
a picture of each printed page and then packages the pages in a PDF file. It is
possible to use optical character recognition (OCR) software to create
textual characters in the PDF file, but this is often not done because the
process takes much longer: minutes for OCR compared to seconds for photographic
snapshots. Another reason for avoiding OCR is that the resulting text usually
contains recognition errors that require manual proofreading and correction
to be accurate, thereby involving more staff time and skill.

Scanning documents into image-only PDF files has been a common way of storing
information for archival purposes because electronic media are much smaller
and less cumbersome than is paper storage. The more that documents originate in
electronic, rather than hard-copy, form, the less likely that documents
need to be scanned to be archived. Thus, as authors rely more on computers as
the original source of documents, the accessibility problem of image-based
PDF may lessen over time.

Searchable Image

Searchable-image PDF also contains an image for each page, but this type
includes a text layer as well. The textual characters are produced from an OCR
process, which analyzes each image for what appear to be characters. Wherever
characters are recognized in the image, the software draws a layer of text
under them. An observer of the page sees the surface image only, as with
image-only PDF.

The text layer enables a PDF file to be searched for phrases of interest to a
reader who is viewing the document. This text also enables PDF files to be
indexed with keywords in a collection of electronic documents, thus permitting a
researcher to find particular ones worth further study.

Adding a text layer increases the size of a PDF file, so text may be omitted if
compactness is of primary importance. Usually, however, the ability to search,
for sighted as well as visually impaired readers, outweighs the cost in extra
size, especially since the text is compressed, as previously mentioned. Since
nonvisual access to PDF content requires text, adding searchability to a PDF
file also benefits accessibility.

Formatted Text and Graphics

A third PDF type, called formatted text and graphics, minimizes the use of
photographic images in favor of textual characters and vector graphics. No image
layer rests on top of a text layer. Instead, textual characters and vector
graphics are drawn wherever they can represent the content of a page.
Photographic
images are used only when they are pictures that cannot be generated from
building blocks of textual characters and vector graphics. This type of PDF is
usually the result of conversion from another electronic file format, such as
Microsoft Word. This type is the most compact (often 10% of an image-only
file with the same content). Also, since this type is built from more structured
components, it may be used more flexibly for other purposes. For example,
such a PDF file might be converted to HTML for display as web pages or converted
to Microsoft Word for editing as part of another document.

A PDF file composed as formatted text and graphics is likely to be more
accessible than one composed as searchable image. Although both types contain
textual
characters, the quality of the text is almost necessarily better in the latter
type because it serves the purpose of presentation as well as searchability.
If the PDF file was created by scanning, more work has probably been done than
with the searchable-image type in order to correct OCR errors and achieve
presentable text. If the PDF file was created by converting another electronic
format, then the textual components are probably more complete, since they
derive directly from character fonts rather than indirectly from recognized
images. Despite the accessibility potential of this PDF type, however, other
problems of a structural nature may pose significant accessibility problems, as
subsequently explained.

Character Decoding

Textual characters are a necessary condition for the accessibility of PDF, but
they are not sufficient on their own. Some PDF-creation tools do not leave
enough information about the fonts used for a PDF viewing program to decipher
all the characters in terms of a well-understood computer alphabet. The viewing
program sees shapes that it knows are characters drawn on the page. The program
then has to do a back-translation of their drawing operations, looking
up the Unicode value for each shape and rendering it as a standard screen
character. If the original font table is embedded in the PDF file, the viewing
program can decode the characters. Decoding is also possible if a common font
was used, such as one built into the operating system. Without an available
font table, however, the viewing program does not know what textual characters
exist because it does quick table lookups rather than sophisticated OCR.

Reading Order

Even if complete character decoding is possible, a PDF file may be inaccessible
because of problems in â??reading order.â?? This term refers to the order of
words, sentences, and paragraphs. Can they be extracted from the text of the PDF
file in a coherent, linear order, or are they mixed together in disconnected,
confusing ways?

For example, the text of a PDF file may appear visually like newspaper columns,
where a line stops midway across the page and continues underneath, rather
than continuing across to the right margin. Visually, on a screen or printout,
the structure of the document is apparent because of extra spacing or a
border line that indicates where one column of text ends and another begins.
Information about this document structure, however, must be represented in
the PDF file for the reading order to be rendered in an intelligible manner by
assistive technology. Without structural information that groups and separates
regions of the page, the document may be inaccessible to nonvisual readers.

Since PDF is frequently chosen for publications that are intended to look
fancier than single-column text, PDF files often contain irregular page layouts
with multiple columns, sidebars, and picture captions. If these files lack an
internal structure, a nonvisual interpretation of them necessarily involves
guesses about reading order, and mistakes can seriously undermine the
comprehension of their content.

Accessibility Options

Tagging PDF Files

To address such accessibility problems, Adobe introduced an extension to PDF
called "tagging." The concept is similar to tags in the HTML format. As
background,
the World Wide Web Consortium (W3C) did pioneering work with HTML tags to
incorporate the document structure that was needed for accessibility as the HTML
standard evolved.

HTML encloses portions of text with markers that indicate the structure or
purpose of the text. For example, a phrase may be tagged as the heading of a
section, the caption of an image, or a cell within a table. Some tags are
necessary for proper visual display in a web browser that interprets HTML files,
whereas other tags--although still a standard part of the HTML language--are
recommended specifically to aid accessibility. For example, accessibility
tags include an indication of the row and column labels of a table, which
enables a screen reader to tell the user about the context of each cell. The
cell information may be useless or confusing without knowing the associated row
and column labels. Collectively, the HTML tags that are needed for accessibility
are sometimes called "accessible markup."

The tagged PDF that Adobe developed provides similar functionality. Tags mark
portions of PDF content and are organized in a sequence that conveys the
suggested
reading order. Whereas HTML files are readable text with tags as words enclosed
in brackets, however, PDF files are in a compressed, binary form with tags
that can be viewed only with special software, such as Adobe Acrobat.

Accessibility Standards and Incentives

The W3C has defined standards for accessible markup, called the "Web Content
Accessibility Guidelines" (WCAG 1.0). The U.S. government has also defined
accessibility standards for web sites, software, and other information
technology in regulations that were first published in 2001 to implement Section
508 of the Rehabilitation Act, as amended. (See For More Information at the end
of this article for a link to these regulations.) Section 508 mandates
that federal agencies provide information to people with disabilities in a
manner that is comparable to that provided to people without disabilities.

Section 508 does not require software manufacturers to make accessible products,
but it does provide them with significant market incentives to do so because
the federal government is a large customer that is interested in products that
meet minimum standards of accessibility. Indeed, Congress adopted Section
508 partly with the stated purpose of creating voluntary market incentives to
develop technologies that benefit people across a broad range of physical
characteristics, not just those with typical levels of eyesight, hearing, manual
dexterity, and other traits.

Adobe, like other companies that sell to the federal government, has noticeably
increased the accessibility of its products in recent years, and its web
site includes information on compliance with Section 508 standards. The tagged
PDF format is an accessibility innovation that the company introduced in
2001. Besides the free program for viewing PDF files, called Adobe Reader, Adobe
sells a commercial program for creating PDF files, including tagged PDF
files, called Adobe Acrobat. The program is available in both a Standard and
Professional version, with the latter having the most tagging features and
being recommended by Adobe to customers who are concerned with accessibility.

Adobe Acrobat

The basic content and layout of a PDF document is usually created and revised
using a word-processing program, such as Microsoft Word or Corel WordPerfect,
and is then converted to PDF to create the final form, exploiting features like
visual fidelity, compact storage, security settings, and cross-platform
portability, as previously described. Adobe Acrobat enables one to convert a
document into PDF from other formats, including plain text, HTML, and popular
word-processing programs. It lets one combine multiple source documents into a
single PDF file, such as a report consisting of a Microsoft Word narrative
and a Microsoft Excel spreadsheet. It then allows the author or designer to
touch up the appearance for the precise presentation that is desired.

Adobe Acrobat includes a feature that analyzes the accessibility of a PDF file.
It reports potential problems, such as characters that are unidentifiable,
structure that is ambiguous, or pictures that are unlabeled. A related feature
adds tags when this can be done with a high degree of certainty about what
markup is appropriate in the context of the document. For example, it may
associate each page footer with a corresponding tag when the analysis finds
significant
space between the rest of the page and the last line of text and that line
contains a page number.

Adobe Acrobat cannot tell what a picture contains, so an author needs to enter a
caption tag for the picture manually. Tables also present a challenge.
Does the left column of the table consist of labels for the rows to the right,
or does it consist of actual data in a table with column labels but no row
labels?

The accessibility report that is produced by Acrobat identifies potential
problems that one typically corrects by selecting a portion of the document and
picking a tag to indicate its purpose. This manual tagging process may involve
significant time and skill, depending on the complexity of the document.

Using Adobe Reader

Adobe and Screen Readers

Assistive technology companies, such as Freedom Scientific, the developer of
JAWS, and GW Micro, the developer of Window-Eyes, have worked with Adobe to
make their screen readers understand the tags of a PDF file that is viewed in
Adobe Reader (or Acrobat) and thereby render more accessible output in speech
or braille. At the time of this writing, the latest release of Adobe Reader is
version 7.0.3, which requires Windows 2000 or XP. When Adobe Reader is launched,
it detects whether a screen reader is running. If so, it presents a dialog box
of configuration options that affect accessibility and sets the default
choices to ones that Adobe Reader finds are the most likely to work best.

The most significant accessibility setting is called "infer reading order from
document." With this setting active, Adobe Reader will analyze an untagged
PDF file and add temporary tags to optimize its reading order. The analysis
examines spacing between blocks of text, for example, to decide whether there
are multiple columns of information.

Although the automatic tagging process is beneficial for reading order, it has
three drawbacks. First, with a large PDF file, containing more than 50 pages,
the process may take a few minutes or more to complete, depending on the
complexity of the document and the speed of the computer. Second, one may not
be able to work with other programs while a document is being tagged because the
tagging process may slow the other programs to an unusable crawl. Third,
the tagging process does not signal when it is complete, so one has to keep
checking with the screen reader to determine whether the file is ready for
reading.

Because of the drawbacks of automatic tagging, Adobe Reader asks the user to
confirm whether to add tags before initiating the process each time it opens
a file. The user will usually want the tagging for better reading order. If the
extra confirmation step seems inefficient or annoying, however, one can
turn it off. The downside is that the computer will then become unusable for a
few minutes whenever a large PDF file is opened and automatic tagging occurs
for the whole file. This tagging process occurs even if the same file has been
opened before--such tags are temporary and not saved by Adobe Reader from
one session to another.

If the confirmation setting is on and the user declines to add tags to the whole
file up front, the user can still read a large PDF file a page at a time.
Whenever the user navigates to a new page, however, there is a pause of a few
seconds while Adobe Reader adds temporary tags for that page and communicates
them to the screen reader.

The many configuration settings of Adobe Reader are located in the Preferences
dialog box under the Edit menu. A hot key for this dialog box is Control-K.
Users of JAWS versions prior to 6.1 should note that pressing its bypass key,
Insert-3, may be necessary before pressing Control-K because JAWS uses Control-K
for other purposes.

Accessibility-related settings of Adobe Reader are located primarily in two tab
pages of the Preferences dialog box, those named Accessibility and Reading.
Adobe Reader also groups most accessibility settings in another dialog box,
however, called the Accessibility Setup Assistant, which is a choice on the
Help menu. This convenient dialog box lets you configure screen-reader settings,
screen-magnifier settings, or both. It lets you either accept all recommended
settings or customize settings through a series of wizard pages. It is suggested
that you accept all recommended settings initially and then explore possible
modifications later if your results are unsatisfactory.

Since screen reader users rely on common hot keys, rather than pointing and
clicking with the mouse, an application may be more challenging if it involves
nonstandard keystrokes. This is partly true of the screen reader interface to
Adobe Reader. For example, one has to learn that Control-Shift-PageUp, rather
than Control-Home, goes to the top of the document. Configuration options are on
the Edit menu, rather than on the View or Tools menu. Some unconventional
interface elements may exist because Adobe makes versions of its Reader software
for several operating systems, so may trade some Windows conventions for
cross-platform consistency.

The problem of unconventional interface, however, is also due to screen reader
adjustments made to accommodate the two different tag modes available: single
page or whole document. Using the example above, Control+Home is, in fact, the
hot key for going to the top of a document in Adobe Reader, just like other
Windows programs. When a screen reader is running, however, it uses Control+Home
to go to either the top of document or top of page, depending on whether
document or page mode is active. Therefore, Control+Shift+PageUp is implemented
as a way to always go to the top of document.

Useful Hot Keys

Some nonstandard but useful hot keys of Adobe Reader are as follows:
List of 7 items
� Control-PageDown or Control-PageUp: Go to the next or previous page
� Control-Shift-PageDown or Control-Shift-PageUp: Go to the bottom or top of
the document
� Control-K: Go to the Preferences dialog box
� Control-D: Display document properties, including security settings and
tagged status that affect accessibility
� Control-Shift+6: Check for accessible reading order
� Alt-F then V: Save to text
� Alt-H then T: Accessibility Setup Assistant
list end

JAWS vs. Window-Eyes

Accessibility comparisons between JAWS and Window-Eyes are often challenging to
make because each program may adopt and add to features that the other started
six months before. Both companies claim to provide support for Adobe Reader that
is comparable to their support for Internet Explorer. With JAWS 6.20 and
Window-Eyes 5.0, we observed progress toward this end.

The table navigation commands of JAWS, which previously worked with web pages in
Internet Explorer, now also work with PDF files in Adobe Reader. The Adobe
Reader Find command, invoked with Control-F, does not work with JAWS. It does
work with Window-Eyes, but after a noticeable delay. Both screen readers,
however, have implemented alternate Find commands that work better:
Control-Insert-F using JAWS or Control-Shift-F using Window-Eyes. Neither screen
reader
fully identifies security settings in the Document Properties window without
requiring navigation of the window using mouse-simulation keys.

In general, both screen readers are sluggish in Adobe Reader, enough so that we
sometimes felt frustrated by the experience of inefficiency (when run under
Windows 2000 on a Pentium 4 computer at 1.9 GHz).

The Bottom Line

PDF files are widespread and necessary to access by people who are blind or have
impaired vision. Although the original format made accessibility difficult,
the newer, tagged format holds promise, and recent versions of Adobe Reader work
better with screen readers.

For More Information

Adobe Systems accessibility page: <www.adobe.com/accessibility>.

Adobe page on Section 508 compliance:
<www.adobe.com/enterprise/accessibility/section508.html>.

Adobe Reader download page: <www.adobe.com/products/acrobat/alternate.html>.

Using Accessible PDF Documents with Adobe Reader 7.0: A Guide for People with
Disabilities: <www.adobe.com/enterprise/accessibility/reader/main.html>.

Online PDF conversion tool:
<www.adobe.com/products/acrobat/access_onlinetools.html>.

Creating Accessible PDF Documents with Adobe Acrobat 7.0:
<www.adobe.com/enterprise/accessibility/pdfs/acro7_pg_ue.pdf>.

Web content accessibility guidelines:
<www.w3.org/WAI/GL/2005/06/f2f-agenda.html>.

Section 508 regulations:
<www.access.gpo.gov/nara/cfr/waisidx_04/36cfr1194_04.html>.

Section 508 technical assistance:
<www.accessboard.gov/sec508/guide/1194.22.htm>.

The opinions expressed in this article are those of the author and do not
necessarily represent the views of the Federal Communications Commission or the
United States Government.



Za prijavu na ovu listu poslati poruku na adresu: 
slikom-request@xxxxxxxxxxxxx i u polju za tekst upisati, subscribe
Za odjavu sa ove liste poslati poruku na adresu: 
slikom-request@xxxxxxxxxxxxx i u polju za tekst upisati, unsubscribe

Other related posts: