Re: Algorithm or ideas wanted for creative text parsing

  • From: "Giovanni Cuccu" <giovanni.cuccu@xxxxxxxxx>
  • To: rjamya@xxxxxxxxx
  • Date: Tue, 11 Apr 2006 09:17:39 +0200

Hi Raj,
   If the exception list grows I think you could improve your
procedure putting the domain exceptions in a table and looping on that
table to find the domain you need.
the domain table could have this structure:
create table DOMAINS (
   LIKE_PATTERN varchar2(102),
   DOMAIN_NAME varchar2(100),
   DOMAIN_ORDER number
);
with a function like
create or replace function getdomain(full_domain in varchar2) return varchar2 is
begin
  for domain_match in (select * from DOMAINS order by DOMAIN_ORDER)
  loop
     if full_domain LIKE domain_match.like_pattern then
        return domain_match.domain_name;
     end if;
  end loop;
  return null;
end getdomain;

you could manage your exception and even try the more probable ones
first (which could lead to less resources usage).This is not
production code but I think it may give you an idea of my suggestion.
Hope it helps,
   Giovanni

On 4/10/06, rjamya <rjamya@xxxxxxxxx> wrote:
> So far I have done ....
>
> CASE
> WHEN :new.domain_name LIKE '%.imageshack.us' THEN 'imageshack.us'
> -- we need to collapse these
> WHEN :new.domain_name LIKE '%.adtexh.de'     THEN 'adtech.de'     --
> we need to collapse these
> WHEN domain_name LIKE '%.echo.cx' THEN 'echo.cx'     -- we need to
> collapse these
> WHEN domain_name LIKE '%.exs.cx' THEN 'exs.cx'     -- we need to collapse 
> these
> WHEN domain_name LIKE '%.bigoo.ws' THEN 'bigoo.ws'     -- we need to
> collapse these
> -- check if the domain name is just numeric
> WHEN 
> regexp_like(trim(domain_name),'[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$')
> THEN domain_name
> -- if it is .com/.name/.org etc get last two pieces
> WHEN length(szlastpiece) > 2 THEN substrdomain_name,pos1+1)
> -- if last piece is a country tld, then get the last 3 pieces
> ELSE substr(domain_name,pos2+1)
> END
>
> szLast piece is the piece after the last '.' in the domain name and
> pos1, pos2 refer to positions of '.' (second from last and 3rd from
> last respectively).
>
> Any gotcha's, missed rules are welcome. The results so far are pretty
> good ... a sample query on 4m+ rows so far shows reliable output.
> there will always be caveats, but I am happy with 99% "hit ratio" ...
> though any improvements are always welcome.
>
> uh oh ... I said the H word ... I am marked now ...
> Raj
>
> On 4/10/06, Gus Spier <gspier@xxxxxxxxxxx> wrote:
> >  Raj,
> >  It looks to me like you're going to have to do some rule based ETL. Start
> > by parsing your URIs on the dots into varrays and then examinining data ....
> > if seg. first == 'www' and seg.last == "com" then harvest seg.length-1 ...
> >
> >  if seg.last='uk' and seg.length-1='co' then harvest seg.length-2 ...
> >
> >  et cetera ad endless nauseam.
> >
> >  But I don't think you can build a script that will reliably trundle out
> > there and correctly get what you want first try.
> >
> >  Good luck
> >  Gus
> --
> ----------------------------------------------
> Got RAC?
> --
> //www.freelists.org/webpage/oracle-l
>
>
>


--
--------------------------------------------------------------------
Another free oracle resource profiler
http://sourceforge.net/projects/oraresprof/
Now version 0.9
--
//www.freelists.org/webpage/oracle-l


Other related posts: