Re: Algorithm or ideas wanted for creative text parsing

  • From: rjamya <rjamya@xxxxxxxxx>
  • To: "Gus Spier" <gspier@xxxxxxxxxxx>
  • Date: Mon, 10 Apr 2006 15:46:19 -0400

So far I have done ....

CASE
WHEN :new.domain_name LIKE '%.imageshack.us' THEN 'imageshack.us'    
-- we need to collapse these
WHEN :new.domain_name LIKE '%.adtexh.de'     THEN 'adtech.de'     --
we need to collapse these
WHEN domain_name LIKE '%.echo.cx' THEN 'echo.cx'     -- we need to
collapse these
WHEN domain_name LIKE '%.exs.cx' THEN 'exs.cx'     -- we need to collapse these
WHEN domain_name LIKE '%.bigoo.ws' THEN 'bigoo.ws'     -- we need to
collapse these
-- check if the domain name is just numeric
WHEN 
regexp_like(trim(domain_name),'[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$')
THEN domain_name
-- if it is .com/.name/.org etc get last two pieces
WHEN length(szlastpiece) > 2 THEN substrdomain_name,pos1+1)
-- if last piece is a country tld, then get the last 3 pieces
ELSE substr(domain_name,pos2+1)
END

szLast piece is the piece after the last '.' in the domain name and
pos1, pos2 refer to positions of '.' (second from last and 3rd from
last respectively).

Any gotcha's, missed rules are welcome. The results so far are pretty
good ... a sample query on 4m+ rows so far shows reliable output.
there will always be caveats, but I am happy with 99% "hit ratio" ...
though any improvements are always welcome.

uh oh ... I said the H word ... I am marked now ...
Raj

On 4/10/06, Gus Spier <gspier@xxxxxxxxxxx> wrote:
>  Raj,
>  It looks to me like you're going to have to do some rule based ETL. Start
> by parsing your URIs on the dots into varrays and then examinining data ....
> if seg. first == 'www' and seg.last == "com" then harvest seg.length-1 ...
>
>  if seg.last='uk' and seg.length-1='co' then harvest seg.length-2 ...
>
>  et cetera ad endless nauseam.
>
>  But I don't think you can build a script that will reliably trundle out
> there and correctly get what you want first try.
>
>  Good luck
>  Gus
--
----------------------------------------------
Got RAC?
--
//www.freelists.org/webpage/oracle-l


Other related posts: