[textop] Re: FW: still not fine-grained and structured enough to be scalable

  • From: "Sen, Kunal" <ksen@xxxxxx>
  • To: <textop@xxxxxxxxxxxxx>
  • Date: Fri, 12 May 2006 15:52:58 -0500

Dr. Martin has raised some fundamental issues. If I understood correctly, he 
believes that in order to be useful and sustainable the system must be created 
with a formal logical structure, grounded in theoretical work done in the area 
of semantic knowledge representations, instead of the current intuitive 
structure. His proposal is a radical departure from the current design, and it 
would be unwise to comment on it without going over all the references more 

However, I'd like to comment on one of his early observations -- "informal 
hierarchies of topics are very much arbitrary (there is no "right place" for a 
node, placing nodes is a matter a personal preferences/goals)". I have been 
involved with the creation of the taxonomic structure for a large general 
reference encyclopedia. Here a small group of people were responsible for 
creating the structure, and they all knew each other, had definite subject 
areas that each individual was responsible for, and there was centralized 
editorial control, and yet the location of a particular node often reflected 
their personal biases. In some cases it came down to matters of taste.

Therefore, when the same is done by a large unconnected group of contributors, 
at a much grander scale, subjective biases are bound to enter the structure. So 
the question is, how can they be minimized, is such a structure viable, and how 
such biases would affect the effective use of such a system? I think this is 
the time to take a hard look at these questions and make sure we are on the 
right path. A collaborative project like this is hard to steer once it is 

= Kunal

Kunal Sen, Ph.D.
Executive Director, International Digital Product Development
Encyclopædia Britannica, Inc.
331 N. LaSalle Street, Chicago, IL 60610
phone: 312 347 7320   fax: 312 347 7966   mobile: 312 961 6217
e-mail: ksen@xxxxxx

-----Original Message-----
From: textop-bounce@xxxxxxxxxxxxx [mailto:textop-bounce@xxxxxxxxxxxxx] On 
Behalf Of Larry Sanger
Sent: Friday, May 12, 2006 1:09 AM
To: textop@xxxxxxxxxxxxx
Subject: [textop] FW: still not fine-grained and structured enough to be 

Forwarding this post on behalf of Philippe Martin; mailing list problems should 
be going away soon, because we'll be setting up mailing lists on a DUF server 
using MailMan soon.  I'll reply soon. --Larry

-----Original Message-----
From: Philippe MARTIN [mailto:phmartin@xxxxxxxxxxxxx]
Sent: Thursday, May 11, 2006 10:54 PM
Subject: still not fine-grained and structured enough to be scalable

Dear all,

The Textop project has a classic list of goals and potential benefits (this is 
not a criticism; like many researchers I also attribute similarly expressed 
potential benefits to my own project although like them I avoid using 
adjectives such as "grandiose", "radical" "revolutionary", "amazingly 
beneficial" and "new"). It also follows a classic approach: a cooperatively 
edited large informal hierarchy of "topics" with a cooperatively edited list of 
document elements (or metadata for these
elements) to be
associated to each topic, with conflicts resolved via discussions and in the 
end by a committee of motivated people or recognised experts in the domain.
For example, I see some similarities between this project and Synview (1985: 
ScholOnto (1999: http://citeseer.ist.psu.edu/shum99representing.html),
MathWorld (1999-2006 http://mathworld.wolfram.com/) and the Open Directory 
Project (1998-2006: http://dmoz.org/) although the ODP is more coarse grained 
(it is much more about whole documents that document elements).
This project is far more coarse-grained and far less ambitious than the 
HALO/Aristotle projects (see http://www.projecthalo.com/ and 
http://www.edge.org/3rd_culture/hillis04/hillis04_index.html), the QED project 
(http://www-unix.mcs.anl.gov/qed/) and the OpenGALEN project 

Nevertheless, is this project achievable and worth to be achieved exactly as it 
is currently described, that is, with a classic rather coarse grained and 
loosely structured approach? 
From my viewpoint, it is not.

The first problem is the informal hierarchy of topics (which may contain "many 
thousands if not millions of outline headers"). It is well recognised that 
informal hierarchies of topics are very much arbitrary (there is no "right 
place" for a node, placing nodes is a matter a personal preferences/goals), 
hence it is difficult to retrieve information or know where to insert 
information and this leads to many redundancies and inconsistencies (in the 
same way that Web documents are often redundant, inconsistent and their content 
difficult to retrieve and compare). This is because there are no 
formal/precise/semantic/meaningful relations (such as category subsumption, 
statement specialization, mereological relations, ...) between the nodes of the 
When indexing "interesting documents", superficial and informal hierarchies of 
topics such as thos of Yahoo or the ODP may make sense (since  documents are 
about many ideas) but when categorising individual ideas, concepts or objects, 
using informal hierarchies cannot work. 
For the Textop project, the minimal support that should be used is
(i) an updatable lexical ontology (and hence semantic network) of 
    English such as for example the one browsable and updatable at
    http://www.webkb.org (although it is derived from WordNet and
    many improvements still need to be made before it can be a 
    "good" support),
(ii) an updatable conceptual/semantic network of individual 
    statements connected by conceptual intersentential relations
    (specialization relations, argumentation relations, rhetoric
     relations, ...).
Even in the hypertext community (for which the linked document elements are not 
necessarily fine-grained), the need for typed hyperlinks was finally recognised 
in the early 90's, as it is again nowadays by the creators of "semantic wikis".

The second (although related) problem of this project is that a "paragraph"
is not a fine-grained enough unit of information to support a scalable 
indexation/retrieval/comparison of information and a "democratic"
cooperation between the information providers. Indeed, for each 
idea/topic/statement there will be thousands of paragraphs (from different
documents) about (or giving an argument for) that particular 
idea/topic/statement and simply listing all of these paragraphs will not permit 
to compare/organise the various underlying 
To do so, the above cited updatable conceptual/semantic network of individual 
statements is required (the unit of information should be a sentence, not a set 
of sentences). And with such a network where each node has a recorded creator, 
it is possible to calculate a value for the "originality" and "usefulness" of 
each statement (and hence also for each creator of statements) based on votes 
and the argumentation tree associated to each statement; thus, there is no need 
for a committee to decide which statements are "correct" and "interesting" and 
remove the other ones; instead, each user can filter out (or change the 
presentation of) statements with low originality/usefulness (a base algorithm 
is given in the sections 2.1 and 2.2 of the articles accessible from 
http://www.webkb.org/doc/papers/iccs05/ but, ideally, options should be 
provided to each user for the calculated values to better reflect what that 
user believes is original or useful).
(Note: Section 2.1 of this article also show small examples of the above cited 
semantic network of statements but it is better to see the more complete 
examples accessible from http://www.webkb.org/kb/classif/sd.html#examples).

To conclude, I believe that a (logic-based) semantic network of categories and 
statements is needed for this project to be scalable and of more interest, 
whether or not the statements are formal, informal  or semi-formal (i.e., 
semi-structured or using some formal terms). Thus, I believe the required 
interface is not the one currently envisaged but one that permits to create a 
semantic network, with some of the nodes being pointers to parts of documents. 
Then, however, there would be few differences between that project and mine, 
and the tools and syntaxes I am developing could be re-used. It is clear that 
the path that I am advocating (that is, precision-oriented knowledge entering) 
is more demanding for information providers (they have to be analytic, precise 
and careful when writing statements), at least until a certain amount of 
information has been represented. But I do not see any escape to that.

Dr. Philippe Martin
Web site: www.phmartin.info
Address: Griffith Uni, School of ICT, PMB 50 GCMC, QLD 9726 Australia 

textop - a Textop (http://www.textop.org) mailing list.
To post, send a mail to textop@xxxxxxxxxxxxx, or just reply to this post.
To unsubscribe, send a mail to textop-request@xxxxxxxxxxxxx with 'unsubscribe' 
in the header.
====textop - a Textop (http://www.textop.org) mailing list.
To post, send a mail to textop@xxxxxxxxxxxxx, or just reply to this post.
To unsubscribe, send a mail to textop-request@xxxxxxxxxxxxx with 'unsubscribe' 
in the header.

Other related posts: