[textop] Re: FW: still not fine-grained and structured enough to be scalable

From: "Larry Sanger" <blarneypilgrim@xxxxxxxxx>
To: <textop@xxxxxxxxxxxxx>
Date: Fri, 12 May 2006 21:39:16 -0700
Responding to Philippe Martin now.  I appreciate the extreme care and
thought that went into this post, Philippe, as well as your personal
expertise in areas extremely relevant to this project.  At this point, this
sort of sweeping re-envisioning of what I have in mind is appropriate and
important to consider.  Even if we do not opt to take your advice, I think
we will learn something important.

> From: Philippe MARTIN [mailto:phmartin@xxxxxxxxxxxxx]
> Sent: Thursday, May 11, 2006 10:54 PM
> Subject: still not fine-grained and structured enough to be scalable
...
> For example, I see some similarities between this project and
> Synview (1985: 
> http://portal.acm.org/ft_gateway.cfm?id=637116&type=pdf),
> ScholOnto (1999: http://citeseer.ist.psu.edu/shum99representing.html),
> MathWorld (1999-2006 http://mathworld.wolfram.com/) and the 
> Open Directory Project (1998-2006: http://dmoz.org/) although 
> the ODP is more coarse grained (it is much more about whole documents 
> that document elements).
> This project is far more coarse-grained and far less 
> ambitious than the HALO/Aristotle projects (see 
> http://www.projecthalo.com/ and 
> http://www.edge.org/3rd_culture/hillis04/hillis04_index.html),
>  the QED project (http://www-unix.mcs.anl.gov/qed/) and the 
> OpenGALEN project (http://www.opengalen.org/).

I can't claim to be familiar with all of these projects, but it seems most
of the projects listed here, especially the latter projects, are indeed much
finer-grained.  Also, they are as much technical projects (no doubt designed
to prove or demonstrate something of interest mainly to information
theorists) as they are reference projects; e.g., they are attempts to build
ontologies, or provide scalable technical models of knowledge, that don't
have *immediate* uses.  MathWorld and the ODP have at least developed a
great deal of useable content--which is what impresses me most, frankly.  I
want Textop to be like that: very useful.

> Nevertheless, is this project achievable and worth to be
> achieved exactly as it is currently described, that is, with 
> a classic rather coarse grained and loosely structured approach? 
> From my viewpoint, it is not.

I love a definite proposition, and that is one!

> The first problem is the informal hierarchy of topics (which
> may contain "many thousands if not millions of outline 
> headers"). It is well recognised that informal hierarchies of 
> topics are very much arbitrary (there is no "right place" for 
> a node, placing nodes is a matter a personal 
> preferences/goals), hence it is difficult to retrieve 
> information or know where to insert information 
> and this leads to many redundancies and inconsistencies (in 
> the same way that Web documents are often redundant, inconsistent and 
> their content difficult to retrieve and compare). This is because 
> there are no formal/precise/semantic/meaningful relations 
> (such as category subsumption, statement specialization, mereological 
> relations, ...) between the nodes of the hierarchy. 

I suspect that this problem is made much more tractable when one is dealing
with text chunks that are individuated precisely by the fact that they make
(or are taken to make) definite, classifiable arguments, propositions,
definitions, etc.  That the items I propose to classify are these sorts of
"text chunks" is crucial to remember.

That's only part of the solution.  Another part is that it is then up to the
designers of the project to *designate* what the parent-child relations
shall mean.  I have followed a certain pattern with the Leviathan that I
have found useful, and which I might explain sometime; but it is clear
enough to me that the fact that there is a variety of choices of rules does
not imply that there's no distinguishing the *quality* of rules, or (more
generally) no way to settle upon a set of rational rules.

Perhaps the more difficult problem is one that Kunal Sen identified--how to
get people to agree on how to create an outline.

> When indexing "interesting documents", superficial and
> informal hierarchies of topics such as thos of Yahoo or the 
> ODP may make sense (since  documents are about many ideas) 
> but when categorising individual ideas, concepts or objects, 
> using informal hierarchies cannot work. 

Not to be merely contrary, but I actually think it is the reverse.  Please
do consult the work on Hobbes' Leviathan I've done so far
(http://www.textop.org/outline_help.html).  Whereas websites and books and
even encyclopedia articles concern very many different topics, and thus are
inherently problematic to classify, chunks of text are a different matter
altogether.  Something I have confirmed to my own satisfaction is that
chunking texts in the way I do makes it possible to organize the results
into an outline with much more satisfactory results than classifying
websites or books.

And bear in mind, the items that are being categorized here are decidedly
*not* "ideas, concepts or objects," but chunks of text.  That's an important
difference.  I very much suspect that you are thinking about the Collation
Project (that's what we're discussing) as an ontology, which *is* about
"ideas, concepts or objects."  But the outline of the Collation Project *is
not* an ontology, nor is it meant to be one.  Again, consult the example.

The fact that we're talking about outlining text chunks, not "ideas,
concepts or objects," makes a difference both in theory and in practice.  In
theory, we *should* expect relatively unclear concepts to require filing in
multiple places, and for an outline built out of concepts to be confusing
and redundant, for the reason that concepts do not enter into *unique*
semantic, logical, and other relations with each other.  But propositions,
definitions, arguments, explanations, etc.--human thought chunked at that
level--*do* fall into more definite relations.  Consider, for example,
"realism" as a concept.  This might fall under many other concepts in an
ontology.  But contrast that with a paragraph articulating what someone
means by "realism" in a particular case.  It is realism *about Platonic
universals*, for example.  Philosophers know where to put that, at least in
relation to a cluster of other related concepts.  Even more definitely can
they say what relation a specific point about realism about Platonic
universals bears to other points.

> For the Textop project, the minimal support that should be used is
> (i) an updatable lexical ontology (and hence semantic network) of 
>     English such as for example the one browsable and updatable at
>     http://www.webkb.org (although it is derived from WordNet and
>     many improvements still need to be made before it can be a 
>     "good" support),

Perhaps.  As much as I love ontologies generally, and the project of
building ontologies, and as much as I admire those who have the technical
chops to build coherent ontologies, I'm not sure what the benefit of a
formal or even a semi-formal ontology would be *in this context*.  I'm
looking at this as a practical project that will have a definite human use;
it's going to be a reference work.  So how would an updatable lexical
ontology be *of use* in this context?  And how we can expect people
constructing the outline just to buy into the ontology wholesale?

Besides, though I'm not absolutely sure of this, I wouldn't be at all
surprised if a usable ontology fell out of the careful examination of texts
in metaphysics, logic, and semantics.  By exploring the logical and other
relations of various definitions, arguments, etc., *in all their glorious
detail* (that's the important part), one is *least* apt to leave some
relevant consideration out.

But it is necessary in any case to follow the text where it leads.  It's
been one of my rules, to create nodes only when necessary to place a text
correctly.  That's why you'll see some areas of my outline of the Leviathan
are very well-developed, and some of them are not.

> (ii) an updatable conceptual/semantic network of individual 
>     statements connected by conceptual intersentential relations
>     (specialization relations, argumentation relations, rhetoric
>      relations, ...).

Well, in any case, I'm trying to analyze the actual words of actual
texts--text chunks--and to "collate" them all together.  What (ii) proposes
is to create a network of *statements*.  So I guess I need to know is how
this network of statements is to be generated: by summarizing texts, for
example?  Or by producing them ourselves, *a priori* as it were, and then
expecting texts to fall neatly into the network?

> Even in the hypertext community (for which the linked
> document elements are not necessarily fine-grained), the need 
> for typed hyperlinks 
> was finally recognised in the early 90's, as it is again 
> nowadays by the creators of "semantic wikis".

Well, the actual structure I propose is a hierarchical outline structure, so
the elements of the structure are nodes.  To create this, and to file chunks
into it as I've done for the Leviathan, it seems to me (at least) that
neither (i) nor (ii) is necessary.  And I'm not even sure why any particular
attempted ontology would be of that much help.  Nodes will, however, have to
have unique identifiers, distinct from the words that make up the header
that lives at a node.  Have a look at the proposed screenshot I have devised
for the contributor interface: http://www.textop.org/screenshot.html

There can also be cross-references, of course.

> The second (although related) problem of this project is that
> a "paragraph" is not a fine-grained enough unit of 
> information to support a scalable 
> indexation/retrieval/comparison of information and a 
> "democratic" cooperation between the information providers. 
> Indeed, for each idea/topic/statement there will be thousands 
> of paragraphs (from different
> documents) about (or giving an argument for) that particular 
> idea/topic/statement and simply 
> listing all of these paragraphs will not permit to 
> compare/organise the various underlying 
> ideas/topics/statements/arguments/objections. 

Of course there will be thousands of paragraphs, if the
"idea/topic/statement" is broad enough.  For example, if the topic were
"arguments for the existence of God," there would be probably tens of
thousands of paragraphs.  But philosophers, at least, actually have names
for different kinds of arguments (e.g., the argument from design vs. the
argument from first causes for the existence of God).  Part of the plan is
to lump arguments (and other linguistic entities) together, when there are
relatively few of them, under a specific node, and split them and find
distinctions when there are many (whether or not there are names to go with
the distinguished types).

Granted, it might turn out that there are *very* many instances of the
argument from design (just for example) in the literature (there must be
literally hundreds of instances of the basic argument being stated, to say
nothing of the discussion of surrounding matters), and that even if we try
to split these into types, we will find that there really aren't any
meaningfully different types that do not themselves each have dozens of
instances.  Well, in that case so be it, I say.  Humanity ought to stop all
this wasteful duplication of effort, already.  Note, we (users) will be able
to filter from which sources chunks will be displayed.  So if there are a
zillion arguments from design, then just show me the ones from the 18th
century; or just the ones with French as the original source; or just the
ones from some list of "the Great Books."  (And such filtering is exactly
how some academics' minds work: it doesn't really matter if it didn't happen
within the last 30 years or so.  It makes their work more tractable.  The
Collation Project, and Textop generally, is going to embarrass such people
terribly.  We'll demonstrate as never before that there's nothing new under
the sun.)

Anyway, your observation above really goes to one of the more interesting
aspects of the Collation Project: if we build the outline *around* the
texts, i.e., if we use our summaries of text chunks to decide what outline
nodes shall exist (as I've done with Hobbes), then we are (together)
exploring the dialectical territory in *enormously fine* detail, something
that to my knowledge has never quite been done before, certainly not to the
extent I'm proposing.

> To do so, the above cited updatable conceptual/semantic network of 
> individual statements is required (the unit of information 
> should be a sentence, not a set of sentences).

Well, then consider the thing to be organized into the outline not the text
chunk but a summary of the text chunk.  (And, by the way, I have actually
put chunks into different parts of the outline under different summaries.
It seemed like the right thing to do at the time.)

> And with such 
> a network where each node has a recorded creator, it is 
> possible to calculate a 
> value for the "originality" and "usefulness" of each 
> statement (and hence also for each creator of statements) 
> based on votes and the 
> argumentation tree associated to each statement; thus, there 
> is no need for a committee to decide which statements are 
> "correct" and "interesting" and remove the other ones; 
> instead, each user 
> can filter out (or change the presentation of) statements 
> with low originality/usefulness (a base algorithm is given in the 
> sections 2.1 and 2.2 of the articles accessible from 
> http://www.webkb.org/doc/papers/iccs05/ but, ideally, options 
> should be provided to each user for the calculated values to 
> better reflect what that user believes is original or useful).
> (Note: Section 2.1 of this article also show small examples of 
> the above cited semantic network of statements but it is better 
> to see the more complete examples accessible from 
> http://www.webkb.org/kb/classif/sd.html#examples).

As Kunal wisely said, to do this justice we'd have to read your references.
So, perhaps I'm not understanding at all, but this is not really sounding
very much like what I'm proposing to do.  Perhaps you can start at a more
basic level.  Are you saying that the project should *rate* particular
statements (outline nodes, I guess) as "correct" or "interesting"?  I would
see that as an unnecessary distraction--a side-project, perhaps.  The
project's aim is not to get at the truth, but to elicit the structure of
various points made in actual books.  Hopefully, the result will help
individuals decide what they think is true.

> To conclude, I believe that a (logic-based) semantic network of
> categories and statements is needed for this project to be 
> scalable

This I don't see that you've proven.  The feasibility of building an outline
of the sort I propose--which may turn out to have all sorts of
imperfections, but does the job--seems to be much more a practical question.

> and of more interest,

Forgive me, but I don't see how you've proven this either.  It would be
immodest of me to claim that the outline I've built of the Leviathan is of
interest even to philosophers, but I am very sure I could go on--strictly by
myself I wanted to--and incorporate, say, Locke's Essay, Hume's Treatise,
some Reid, some Mill, some Russell, and I would have a very nice outline of
the history of English language philosophy that would be of considerable
interest to historians of philosophy as a reference.  Especially if
specialists were to clean it up and help me with it.  Why *wouldn't* such a
thing be of considerable interest?

What I don't understand is how the interest of the result of this work would
*increase* if the outline were somehow based on, say, your ontology.  I
don't mean to claim it *wouldn't*, but I don't see how you've supported
this.

> whether or not the statements are formal, 
> informal  or semi-formal (i.e., semi-structured or using some 
> formal terms). Thus, I believe the required interface is not 
> the one currently envisaged but one that permits to create a 
> semantic network, with some of the nodes being pointers to 
> parts of documents. Then, however, there would be few 
> differences between that project and mine, and the tools and 
> syntaxes I am developing could be re-used.

It would be very interesting indeed to have an elaboration of this.  I think
that's probably a useful way to move this discussion forward.  What would
the input interface be like?  How would *the collation of texts* proceed?
What's the overall procedure?  How would the *result* look different from
what I've illustrated?  If you like, you can ignore all my other replies and
focus on these questions, because it's really what I'm interested in: other
clear, viable options.

> It is clear that the path that I am advocating (that is, 
> precision-oriented knowledge entering) is more demanding for 
> information providers (they have to be analytic, precise and 
> careful when writing statements), at least until a certain amount of 
> information has been represented. But I do not see any escape to that.

I actually would say that the technical barriers even to what *I* propose
are very difficult--actually teaching people to use the software and
understand the system, that will be a challenge to say the least.  To
require further that they be logicians and severely analytically-minded is
to limit the number of participants *much* more.

Philippe, I apologize for this long and vigorous reply, but I hope you will
take that as a sign of my respect for and interest in your ideas.
Furthermore, I really do believe you've raised some very interesting issues.

--Larry

====textop - a Textop (http://www.textop.org) mailing list.
To post, send a mail to textop@xxxxxxxxxxxxx, or just reply to this post.
To unsubscribe, send a mail to textop-request@xxxxxxxxxxxxx with 'unsubscribe' 
in the header.
References:
- [textop] FW: still not fine-grained and structured enough to be scalable
  - From: Larry Sanger
[textop] Re: FW: still not fine-grained and structured enough to be scalable

Other related posts: