[mt] word clustering

  • From: "Mandel Shi" <mandel@xxxxxxxx>
  • To: "mt" <mt@xxxxxxxxxxxxx>
  • Date: Fri, 1 Oct 2004 17:11:09 +0800

Dear mt subscribers

about the two programs for word clustering:

the maximization equation in "Algorithms for bigram and trigram word clustering 
Sven Martin, Jorg Liermann, Hermann Ney" is ok, but the algorithm description 
is vague in some aspects and uses perplexity as a criterium.

it is safe to say that log-likelyhood should be maximized and perplexity should 
be minimized. in fact they are the same.

but my implementation is buggy and I have not found out the bug myself. so I 
resend the modified program (but don't dwell on the maximization criterium. it 
is correct now. something is wrong with the counts updating process). you can 
test the result using the following small corpus small.txt

w1 w2 w3
w1 w5 w3

if we want 5 classes, they should be OOV, <s>, {w1}, {w2,w5}, {w3}. using 
cluster1 -c 5 small.txt, you will get this result.

I also implemented a version of Brown 1992 algorithm. It is a great paper. This 
implementation, i believe, is free of bugs, but it is very slow. you can these 
these two programs on the another Chinese corpus using

 cluster1 -c 15 chinese.txt
 cluster2 -c 15 chinese.txt

the brown algorithm gives accurate results. but for larger corpus, my 
straightforward implementation is painfully slow.

all the relevant files can be downloaded from

happy holiday!

Best Regards

Professor Mandel Shi
Department of Computer Science
Xiamen University
361005, Xiamen Fujian, China
Tel: +86-592-218-8355


Other related posts:

  • » [mt] word clustering