[textop] Re: FW: still not fine-grained and structured enough to be scalable
- From: "Sen, Kunal" <ksen@xxxxxx>
- To: <textop@xxxxxxxxxxxxx>
- Date: Fri, 12 May 2006 15:52:58 -0500
Dr. Martin has raised some fundamental issues. If I understood correctly, he
believes that in order to be useful and sustainable the system must be created
with a formal logical structure, grounded in theoretical work done in the area
of semantic knowledge representations, instead of the current intuitive
structure. His proposal is a radical departure from the current design, and it
would be unwise to comment on it without going over all the references more
carefully.
However, I'd like to comment on one of his early observations -- "informal
hierarchies of topics are very much arbitrary (there is no "right place" for a
node, placing nodes is a matter a personal preferences/goals)". I have been
involved with the creation of the taxonomic structure for a large general
reference encyclopedia. Here a small group of people were responsible for
creating the structure, and they all knew each other, had definite subject
areas that each individual was responsible for, and there was centralized
editorial control, and yet the location of a particular node often reflected
their personal biases. In some cases it came down to matters of taste.
Therefore, when the same is done by a large unconnected group of contributors,
at a much grander scale, subjective biases are bound to enter the structure. So
the question is, how can they be minimized, is such a structure viable, and how
such biases would affect the effective use of such a system? I think this is
the time to take a hard look at these questions and make sure we are on the
right path. A collaborative project like this is hard to steer once it is
launched.
= Kunal
-------------------------------------------------------------
Kunal Sen, Ph.D.
Executive Director, International Digital Product Development
Encyclopædia Britannica, Inc.
331 N. LaSalle Street, Chicago, IL 60610
phone: 312 347 7320 fax: 312 347 7966 mobile: 312 961 6217
e-mail: ksen@xxxxxx
-----Original Message-----
From: textop-bounce@xxxxxxxxxxxxx [mailto:textop-bounce@xxxxxxxxxxxxx] On
Behalf Of Larry Sanger
Sent: Friday, May 12, 2006 1:09 AM
To: textop@xxxxxxxxxxxxx
Subject: [textop] FW: still not fine-grained and structured enough to be
scalable
Forwarding this post on behalf of Philippe Martin; mailing list problems should
be going away soon, because we'll be setting up mailing lists on a DUF server
using MailMan soon. I'll reply soon. --Larry
-----Original Message-----
From: Philippe MARTIN [mailto:phmartin@xxxxxxxxxxxxx]
Sent: Thursday, May 11, 2006 10:54 PM
Subject: still not fine-grained and structured enough to be scalable
Dear all,
The Textop project has a classic list of goals and potential benefits (this is
not a criticism; like many researchers I also attribute similarly expressed
potential benefits to my own project although like them I avoid using
adjectives such as "grandiose", "radical" "revolutionary", "amazingly
beneficial" and "new"). It also follows a classic approach: a cooperatively
edited large informal hierarchy of "topics" with a cooperatively edited list of
document elements (or metadata for these
elements) to be
associated to each topic, with conflicts resolved via discussions and in the
end by a committee of motivated people or recognised experts in the domain.
For example, I see some similarities between this project and Synview (1985:
http://portal.acm.org/ft_gateway.cfm?id=637116&type=pdf),
ScholOnto (1999: http://citeseer.ist.psu.edu/shum99representing.html),
MathWorld (1999-2006 http://mathworld.wolfram.com/) and the Open Directory
Project (1998-2006: http://dmoz.org/) although the ODP is more coarse grained
(it is much more about whole documents that document elements).
This project is far more coarse-grained and far less ambitious than the
HALO/Aristotle projects (see http://www.projecthalo.com/ and
http://www.edge.org/3rd_culture/hillis04/hillis04_index.html), the QED project
(http://www-unix.mcs.anl.gov/qed/) and the OpenGALEN project
(http://www.opengalen.org/).
Nevertheless, is this project achievable and worth to be achieved exactly as it
is currently described, that is, with a classic rather coarse grained and
loosely structured approach?
From my viewpoint, it is not.
The first problem is the informal hierarchy of topics (which may contain "many
thousands if not millions of outline headers"). It is well recognised that
informal hierarchies of topics are very much arbitrary (there is no "right
place" for a node, placing nodes is a matter a personal preferences/goals),
hence it is difficult to retrieve information or know where to insert
information and this leads to many redundancies and inconsistencies (in the
same way that Web documents are often redundant, inconsistent and their content
difficult to retrieve and compare). This is because there are no
formal/precise/semantic/meaningful relations (such as category subsumption,
statement specialization, mereological relations, ...) between the nodes of the
hierarchy.
When indexing "interesting documents", superficial and informal hierarchies of
topics such as thos of Yahoo or the ODP may make sense (since documents are
about many ideas) but when categorising individual ideas, concepts or objects,
using informal hierarchies cannot work.
For the Textop project, the minimal support that should be used is
(i) an updatable lexical ontology (and hence semantic network) of
English such as for example the one browsable and updatable at
http://www.webkb.org (although it is derived from WordNet and
many improvements still need to be made before it can be a
"good" support),
(ii) an updatable conceptual/semantic network of individual
statements connected by conceptual intersentential relations
(specialization relations, argumentation relations, rhetoric
relations, ...).
Even in the hypertext community (for which the linked document elements are not
necessarily fine-grained), the need for typed hyperlinks was finally recognised
in the early 90's, as it is again nowadays by the creators of "semantic wikis".
The second (although related) problem of this project is that a "paragraph"
is not a fine-grained enough unit of information to support a scalable
indexation/retrieval/comparison of information and a "democratic"
cooperation between the information providers. Indeed, for each
idea/topic/statement there will be thousands of paragraphs (from different
documents) about (or giving an argument for) that particular
idea/topic/statement and simply listing all of these paragraphs will not permit
to compare/organise the various underlying
ideas/topics/statements/arguments/objections.
To do so, the above cited updatable conceptual/semantic network of individual
statements is required (the unit of information should be a sentence, not a set
of sentences). And with such a network where each node has a recorded creator,
it is possible to calculate a value for the "originality" and "usefulness" of
each statement (and hence also for each creator of statements) based on votes
and the argumentation tree associated to each statement; thus, there is no need
for a committee to decide which statements are "correct" and "interesting" and
remove the other ones; instead, each user can filter out (or change the
presentation of) statements with low originality/usefulness (a base algorithm
is given in the sections 2.1 and 2.2 of the articles accessible from
http://www.webkb.org/doc/papers/iccs05/ but, ideally, options should be
provided to each user for the calculated values to better reflect what that
user believes is original or useful).
(Note: Section 2.1 of this article also show small examples of the above cited
semantic network of statements but it is better to see the more complete
examples accessible from http://www.webkb.org/kb/classif/sd.html#examples).
To conclude, I believe that a (logic-based) semantic network of categories and
statements is needed for this project to be scalable and of more interest,
whether or not the statements are formal, informal or semi-formal (i.e.,
semi-structured or using some formal terms). Thus, I believe the required
interface is not the one currently envisaged but one that permits to create a
semantic network, with some of the nodes being pointers to parts of documents.
Then, however, there would be few differences between that project and mine,
and the tools and syntaxes I am developing could be re-used. It is clear that
the path that I am advocating (that is, precision-oriented knowledge entering)
is more demanding for information providers (they have to be analytic, precise
and careful when writing statements), at least until a certain amount of
information has been represented. But I do not see any escape to that.
Philippe
_____________________________________________________________________
Dr. Philippe Martin
Web site: www.phmartin.info
Address: Griffith Uni, School of ICT, PMB 50 GCMC, QLD 9726 Australia
_____________________________________________________________________
==
textop - a Textop (http://www.textop.org) mailing list.
To post, send a mail to textop@xxxxxxxxxxxxx, or just reply to this post.
To unsubscribe, send a mail to textop-request@xxxxxxxxxxxxx with 'unsubscribe'
in the header.
====textop - a Textop (http://www.textop.org) mailing list.
To post, send a mail to textop@xxxxxxxxxxxxx, or just reply to this post.
To unsubscribe, send a mail to textop-request@xxxxxxxxxxxxx with 'unsubscribe'
in the header.
Other related posts: