Re: Complex CONTEXT index

  • From: Nigel Thomas <nigel.cl.thomas@xxxxxxxxxxxxxx>
  • To: bill@xxxxxxxxxxxx, ORACLE-L <oracle-l@xxxxxxxxxxxxx>
  • Date: Fri, 23 Jan 2009 16:13:32 +0000

Bill

Don't you need to translate the BLOB content into indexable text before you
index it? A simple transliteration of hex values is no help; you need
something that would convert the enclosed encoded Word or PDF into real
words.

   - PDF to text - there are some solutions out there (eg
PDFbox<http://www.pdfbox.org/>- an OSS java toolkit; found by Google,
no idea if it really works).
   - Word to text - you could try eg Apache POI
<http://poi.apache.org/>(same reservations, and looks like old Word
formats may be poorly served).
   Obviously this will be much easier once you get to Office Open XML file
   formats - you can just take the XML and dump the text without markup into
   your CLOB.
   - In both cases, you'd build a BLOB-to-CLOB converter using a Java stored
   proc.

Once you have indexed the text representation, you can of course discard it
(or save some/all of it for preview purposes...)

Regards Nigel

2009/1/23 Bill Zakrzewski <bill@xxxxxxxxxxxx>

> Listers -
> Oracle 10.2.0.4.0
> RH Linux
>
> I have a table (see below) that I would like to create a Context/Intermedia
> index on the title, short_desc, long_desc and the document (BLOB column).  I
> have created a similar index on a different table that contained a CLOB by
> concatenating all of the fields into a single CLOB  and creating the CONTEXT
> index using the pl/sql package/procedure (see below).  I would like to do
> the same thing using the BLOB column, but not sure what values to use in the
> parameters for the DBMS_LOB.CONVERTTOCLOB procedure, specifically the
> BLOB_CSID and LANG_CONTEXT.  My concern is the defaults will cause it to
> copy the data in binary format and not convert correctly, as the document
> may be a PDF or WORD Document or Excel Spreadsheet, etc.   Thanks in advance
> for your help.
>
>

Other related posts: