Re: Algorithm or ideas wanted for creative text parsing

  • From: Stephane Faroult <sfaroult@xxxxxxxxxxxx>
  • To: rjamya@xxxxxxxxx
  • Date: Mon, 10 Apr 2006 19:37:15 +0200

Raj,

I did something similar at one time and didn't find anything cleverer than storing somewhere how many "segments" are significant for one given substr(your_stuff, instr(your_stuff, '.', -1, 1) + 1).
For instance, with a .com, .net or .edu you just need the previous piece, for a .uk or a .sg you need the two previous pieces. But it would be too easy if it were as simple, because for .ca you can have big companies that are myname.ca or smaller ones that are monnom.qc.ca. Same story with .us, often (but not always) preceded by a state code, or with .fr because you can have generic stuff (such as .gouv) preceding the termination.


Brace yourself for CASE clauses of death in your statements ...

HTH

Stéphane Faroult


rjamya wrote:

Basically I am looking to isolate just the (distinct) domain name from
fully qualified domain names that you'd normally see in web-surfing.

I am working on couple of techniques, but it gets complicated since
TLDs differ in format and there is only so much you can do with
substr().

sample data ...

a836.v8519e.c8519.g.vm.akamaistream.net
a705.l1923962123.c19239.n.lm.akamaistream.net
db.c7.bf.a0.top.list.ru
a1657.l1923962104.c19239.n.lm.akamaistream.net
a1181.v21080b.c21080.g.vm.akamaistream.net
dl1.games.vip.scd.yahoo.com
lcp.mud.us.music.yahoo.com
www.celhs.osceola.k12.fl.us
www.celhs.osceola.k12.fl.us
www.celhs.osceola.k12.fl.us
w.s0.gc.sj.ipixmedia.com
w.s0.gc.sj.ipixmedia.com
v.s0.gc.sj.ipixmedia.com
us.1.p6.webhosting.yahoo.com
p1.music.vip.sc5.yahoo.com
lib1.store.vip.sc5.yahoo.com
www.twingroves.district96.k12.il.us
www.twingroves.district96.k12.il.us
www.the-simpsons.hpg.ig.com.br
www.schools.pinellas.k12.fl.us
www.rails4days.pwp.blueyonder.co.uk
www.rails4days.pwp.blueyonder.co.uk
www.garrp.dhr.state.ga.us
www.celhs.osceola.k12.fl.us
www.williamrobertson.pwp.blueyonder.co.uk
www.williamrobertson.pwp.blueyonder.co.uk
lcp.mud.us.music.yahoo.com
c.s0.gc.sj.ipixmedia.com
c.s0.gc.sj.ipixmedia.com
ax.phobos.apple.com
ax.phobos.apple.com
0982660.1206.feed.yellowpagecity.com
0982660.1207.feed.yellowpagecity.com

and by some magic the output should be ....

akamaistream.net
apple.com
yahoo.com
fl.us
ipixmedia.com
il.us
ig.com.br
blueyonder.co.uk
ga.us
yellowpagecity.com

Any ideas, thoughts?  I'd prefer to do this in SQL if possible, else
I'd prefer plsql. The data is already in a 10.1.0.4 database.

Thanks in advance
Raj
----------------------------------------------
Got RAC?
--
//www.freelists.org/webpage/oracle-l







--
//www.freelists.org/webpage/oracle-l


Other related posts: