Re: Algorithm or ideas wanted for creative text parsing

  • From: rjamya <rjamya@xxxxxxxxx>
  • To: "Stephane Faroult" <sfaroult@xxxxxxxxxxxx>
  • Date: Mon, 10 Apr 2006 13:34:43 -0400

Thanks SF and all

maybe here is what I can do ...

1. if the domain is numeric, take it as it is
2. if the TLD (i.e. the last piece) is 3 or more characters, you take
last 2 pieces
    (this will cover com,org,edu,name,info,museum etc)
3. if the last piece is 2 characters (most likely a ccTLD), take last 3 pieces
    (i.e. il, br, ca, uk etc)

hmmm ... looks promising, am I missing anything?

On 4/10/06, Stephane Faroult <sfaroult@xxxxxxxxxxxx> wrote:
> Raj,
>     I did something similar at one time and didn't find anything
> cleverer than storing somewhere how many "segments" are significant for
> one given substr(your_stuff, instr(your_stuff, '.', -1, 1) + 1).
> For instance, with a .com, .net or .edu you just need the previous
> piece, for a .uk or a .sg you need the two previous pieces. But it would
> be too easy if it were as simple, because for .ca you can have big
> companies that are or smaller ones that are Same
> story with .us, often (but not always) preceded by a state code, or with
> .fr because you can have generic stuff  (such as .gouv) preceding the
> termination.
> Brace yourself for CASE clauses of death in your statements ...
> Stéphane Faroult
Got RAC?

Other related posts: