[textop] FW: still not fine-grained and structured enough to be scalable

  • From: "Larry Sanger" <larry.sanger@xxxxxxxxxxxxxxxx>
  • To: <textop@xxxxxxxxxxxxx>
  • Date: Thu, 11 May 2006 23:08:37 -0700

Forwarding this post on behalf of Philippe Martin; mailing list problems
should be going away soon, because we'll be setting up mailing lists on a
DUF server using MailMan soon.  I'll reply soon. --Larry

-----Original Message-----
From: Philippe MARTIN [mailto:phmartin@xxxxxxxxxxxxx] 
Sent: Thursday, May 11, 2006 10:54 PM
Subject: still not fine-grained and structured enough to be scalable

Dear all,

The Textop project has a classic list of goals and potential benefits (this
is not a criticism; like many researchers I also attribute 
similarly expressed potential benefits to my own project although like them
I avoid using adjectives such as "grandiose", "radical" "revolutionary",
"amazingly beneficial" and "new"). It also follows a classic approach: a
cooperatively edited large informal hierarchy of "topics" with a
cooperatively edited list of document elements (or metadata for these
elements) to be 
associated to each topic, with conflicts resolved via discussions and in the
end by a committee of motivated people or recognised 
experts in the domain.
For example, I see some similarities between this project and 
Synview (1985: http://portal.acm.org/ft_gateway.cfm?id=637116&type=pdf),
ScholOnto (1999: http://citeseer.ist.psu.edu/shum99representing.html),
MathWorld (1999-2006 http://mathworld.wolfram.com/) and the Open Directory
Project (1998-2006: http://dmoz.org/) although the ODP is more coarse
grained (it is much more about whole documents 
that document elements).
This project is far more coarse-grained and far less ambitious than the
HALO/Aristotle projects (see http://www.projecthalo.com/ and
http://www.edge.org/3rd_culture/hillis04/hillis04_index.html), the QED
project (http://www-unix.mcs.anl.gov/qed/) and the 
OpenGALEN project (http://www.opengalen.org/).

Nevertheless, is this project achievable and worth to be achieved exactly as
it is currently described, that is, with a classic rather 
coarse grained and loosely structured approach? 
From my viewpoint, it is not.

The first problem is the informal hierarchy of topics (which may contain
"many thousands if not millions of outline headers"). It is well recognised
that informal hierarchies of topics are very much arbitrary (there is no
"right place" for a node, placing nodes is a matter a personal
preferences/goals), hence it is difficult to retrieve information or know
where to insert information 
and this leads to many redundancies and inconsistencies (in the same way
that Web documents are often redundant, inconsistent and 
their content difficult to retrieve and compare). This is because 
there are no formal/precise/semantic/meaningful relations (such as category
subsumption, statement specialization, mereological 
relations, ...) between the nodes of the hierarchy. 
When indexing "interesting documents", superficial and informal hierarchies
of topics such as thos of Yahoo or the ODP may make sense (since  documents
are about many ideas) but when categorising individual ideas, concepts or
objects, using informal hierarchies cannot work. 
For the Textop project, the minimal support that should be used is 
(i) an updatable lexical ontology (and hence semantic network) of 
    English such as for example the one browsable and updatable at
    http://www.webkb.org (although it is derived from WordNet and
    many improvements still need to be made before it can be a 
    "good" support),
(ii) an updatable conceptual/semantic network of individual 
    statements connected by conceptual intersentential relations
    (specialization relations, argumentation relations, rhetoric
     relations, ...).
Even in the hypertext community (for which the linked document elements are
not necessarily fine-grained), the need for typed hyperlinks 
was finally recognised in the early 90's, as it is again nowadays by the
creators of "semantic wikis".

The second (although related) problem of this project is that a "paragraph"
is not a fine-grained enough unit of information to support a scalable
indexation/retrieval/comparison of information and a "democratic"
cooperation between the information providers. Indeed, for each
idea/topic/statement there will be thousands of paragraphs (from different
documents) about (or giving an argument for) that particular
idea/topic/statement and simply 
listing all of these paragraphs will not permit to compare/organise the
various underlying ideas/topics/statements/arguments/objections. 
To do so, the above cited updatable conceptual/semantic network of 
individual statements is required (the unit of information should be a
sentence, not a set of sentences). And with such a network where each node
has a recorded creator, it is possible to calculate a 
value for the "originality" and "usefulness" of each statement (and hence
also for each creator of statements) based on votes and the 
argumentation tree associated to each statement; thus, there is no need for
a committee to decide which statements are "correct" and "interesting" and
remove the other ones; instead, each user 
can filter out (or change the presentation of) statements with low
originality/usefulness (a base algorithm is given in the 
sections 2.1 and 2.2 of the articles accessible from
http://www.webkb.org/doc/papers/iccs05/ but, ideally, options should be
provided to each user for the calculated values to 
better reflect what that user believes is original or useful).
(Note: Section 2.1 of this article also show small examples of 
the above cited semantic network of statements but it is better 
to see the more complete examples accessible from
http://www.webkb.org/kb/classif/sd.html#examples).

To conclude, I believe that a (logic-based) semantic network of 
categories and statements is needed for this project to be scalable and of
more interest, whether or not the statements are formal, 
informal  or semi-formal (i.e., semi-structured or using some 
formal terms). Thus, I believe the required interface is not the one
currently envisaged but one that permits to create a semantic network, with
some of the nodes being pointers to parts of documents. Then, however, there
would be few differences between that project and mine, and the tools and
syntaxes I am developing could be 
re-used. It is clear that the path that I am advocating (that is, 
precision-oriented knowledge entering) is more demanding for information
providers (they have to be analytic, precise and 
careful when writing statements), at least until a certain amount of 
information has been represented. But I do not see any escape to that.

Philippe
_____________________________________________________________________
Dr. Philippe Martin 
Web site: www.phmartin.info
Address: Griffith Uni, School of ICT, PMB 50 GCMC, QLD 9726 Australia
_____________________________________________________________________



====textop - a Textop (http://www.textop.org) mailing list.
To post, send a mail to textop@xxxxxxxxxxxxx, or just reply to this post.
To unsubscribe, send a mail to textop-request@xxxxxxxxxxxxx with 'unsubscribe' 
in the header.

Other related posts: