Forwarding this post on behalf of Philippe Martin; mailing list problems should be going away soon, because we'll be setting up mailing lists on a DUF server using MailMan soon. I'll reply soon. --Larry -----Original Message----- From: Philippe MARTIN [mailto:phmartin@xxxxxxxxxxxxx] Sent: Thursday, May 11, 2006 10:54 PM Subject: still not fine-grained and structured enough to be scalable Dear all, The Textop project has a classic list of goals and potential benefits (this is not a criticism; like many researchers I also attribute similarly expressed potential benefits to my own project although like them I avoid using adjectives such as "grandiose", "radical" "revolutionary", "amazingly beneficial" and "new"). It also follows a classic approach: a cooperatively edited large informal hierarchy of "topics" with a cooperatively edited list of document elements (or metadata for these elements) to be associated to each topic, with conflicts resolved via discussions and in the end by a committee of motivated people or recognised experts in the domain. For example, I see some similarities between this project and Synview (1985: http://portal.acm.org/ft_gateway.cfm?id=637116&type=pdf), ScholOnto (1999: http://citeseer.ist.psu.edu/shum99representing.html), MathWorld (1999-2006 http://mathworld.wolfram.com/) and the Open Directory Project (1998-2006: http://dmoz.org/) although the ODP is more coarse grained (it is much more about whole documents that document elements). This project is far more coarse-grained and far less ambitious than the HALO/Aristotle projects (see http://www.projecthalo.com/ and http://www.edge.org/3rd_culture/hillis04/hillis04_index.html), the QED project (http://www-unix.mcs.anl.gov/qed/) and the OpenGALEN project (http://www.opengalen.org/). Nevertheless, is this project achievable and worth to be achieved exactly as it is currently described, that is, with a classic rather coarse grained and loosely structured approach? From my viewpoint, it is not. The first problem is the informal hierarchy of topics (which may contain "many thousands if not millions of outline headers"). It is well recognised that informal hierarchies of topics are very much arbitrary (there is no "right place" for a node, placing nodes is a matter a personal preferences/goals), hence it is difficult to retrieve information or know where to insert information and this leads to many redundancies and inconsistencies (in the same way that Web documents are often redundant, inconsistent and their content difficult to retrieve and compare). This is because there are no formal/precise/semantic/meaningful relations (such as category subsumption, statement specialization, mereological relations, ...) between the nodes of the hierarchy. When indexing "interesting documents", superficial and informal hierarchies of topics such as thos of Yahoo or the ODP may make sense (since documents are about many ideas) but when categorising individual ideas, concepts or objects, using informal hierarchies cannot work. For the Textop project, the minimal support that should be used is (i) an updatable lexical ontology (and hence semantic network) of English such as for example the one browsable and updatable at http://www.webkb.org (although it is derived from WordNet and many improvements still need to be made before it can be a "good" support), (ii) an updatable conceptual/semantic network of individual statements connected by conceptual intersentential relations (specialization relations, argumentation relations, rhetoric relations, ...). Even in the hypertext community (for which the linked document elements are not necessarily fine-grained), the need for typed hyperlinks was finally recognised in the early 90's, as it is again nowadays by the creators of "semantic wikis". The second (although related) problem of this project is that a "paragraph" is not a fine-grained enough unit of information to support a scalable indexation/retrieval/comparison of information and a "democratic" cooperation between the information providers. Indeed, for each idea/topic/statement there will be thousands of paragraphs (from different documents) about (or giving an argument for) that particular idea/topic/statement and simply listing all of these paragraphs will not permit to compare/organise the various underlying ideas/topics/statements/arguments/objections. To do so, the above cited updatable conceptual/semantic network of individual statements is required (the unit of information should be a sentence, not a set of sentences). And with such a network where each node has a recorded creator, it is possible to calculate a value for the "originality" and "usefulness" of each statement (and hence also for each creator of statements) based on votes and the argumentation tree associated to each statement; thus, there is no need for a committee to decide which statements are "correct" and "interesting" and remove the other ones; instead, each user can filter out (or change the presentation of) statements with low originality/usefulness (a base algorithm is given in the sections 2.1 and 2.2 of the articles accessible from http://www.webkb.org/doc/papers/iccs05/ but, ideally, options should be provided to each user for the calculated values to better reflect what that user believes is original or useful). (Note: Section 2.1 of this article also show small examples of the above cited semantic network of statements but it is better to see the more complete examples accessible from http://www.webkb.org/kb/classif/sd.html#examples). To conclude, I believe that a (logic-based) semantic network of categories and statements is needed for this project to be scalable and of more interest, whether or not the statements are formal, informal or semi-formal (i.e., semi-structured or using some formal terms). Thus, I believe the required interface is not the one currently envisaged but one that permits to create a semantic network, with some of the nodes being pointers to parts of documents. Then, however, there would be few differences between that project and mine, and the tools and syntaxes I am developing could be re-used. It is clear that the path that I am advocating (that is, precision-oriented knowledge entering) is more demanding for information providers (they have to be analytic, precise and careful when writing statements), at least until a certain amount of information has been represented. But I do not see any escape to that. Philippe _____________________________________________________________________ Dr. Philippe Martin Web site: www.phmartin.info Address: Griffith Uni, School of ICT, PMB 50 GCMC, QLD 9726 Australia _____________________________________________________________________ ====textop - a Textop (http://www.textop.org) mailing list. To post, send a mail to textop@xxxxxxxxxxxxx, or just reply to this post. To unsubscribe, send a mail to textop-request@xxxxxxxxxxxxx with 'unsubscribe' in the header.