[pcductape] Re: How does Google search?

  • From: Scott McNay <Wizard@xxxxxxxx>
  • To: Victor Firestone <pcductape@xxxxxxxxxxxxx>
  • Date: Thu, 28 Aug 2003 19:30:58 -0500

Hi Victor,

Thursday, August 28, 2003, 3:02:48 PM, you wrote:

VF> The only webpages that Google will never find are company internal
VF> Intranet webpages - all the rest eventually will end up in Googles
VF> database. [Which, thinking about it must be huge / gigantic. I
VF> wonder what they run their database on and what type of database
VF> server - I am sure it must be at least a 4 processor machine].

I guess you haven't seen this page before:

http://www.google.com/technology/pigeonrank.html

It's obvious humor, yet on the other hand, it does tell you how Google
works behind the scenes (replace "PigeonRank" with  "PageRank", etc.).
Looks like they're running Linux (presumably with Beowulf clustering
software) on standard rack-mounted computers (apparently they don't
use backplanes or blades).

The fact that they keep a cache of all of the text on each web page
means that they need huge amounts of storage space just to store
that, even if stored compresed.

My experiments with high-speed full-text indexing were interesting;
you can get impressive speed as long as you don't mind huge indices.
On the other hand, if you're indexing huge quantities of data, the
size becomes more reasonable. My experiment stored blocks of 8 (or
whatever it was) characters of text along with a pointer to which
record(s) that text could be found in.  For example, this paragraph
would be stored as:

My exper  Msg 2177 line 25, msg 2177 line 28
y experi  Msg 2177 line 25, msg 2177 line 28
 experim  Msg 2177 line 25, msg 2177 line 28
experime  Msg 2177 line 25, msg 2177 line 28
xperimen  Msg 2177 line 25, msg 2177 line 28
periment  Msg 2177 line 25, msg 2177 line 28
eriments  Msg 2177 line 25
riments   Msg 2177 line 25
iments w  Msg 2177 line 25

etc.  I used a standard index for the blocks themselves.  The space
could be optimized greatly by using LZW compression.

--Scott.


To unsubscribe from this list send an email to
pcductape-request@xxxxxxxxxxxxx with 'unsubscribe' in the Subject field
OR by logging into the Web interface. 

Other related posts: