Greetings,
On Wed, Feb 3, 2010 at 9:30 PM, Santhosh Thottingal
<santhosh00@xxxxxxxxx> wrote:
On Tue, Feb 2, 2010 at 5:16 PM, Rajagopal Swaminathan
<raju.rajsand@xxxxxxxxx> wrote:
If your intention is to get a list of words from a file containing
tamil unicode data, try this:
perl -e 'binmode(STDIN, ":utf8"); binmode(STDOUT, ":utf8"); while(
defined (my $c=getc(STDIN)) ) { if( $c =~ /[\x{0b80}-\x{0bff}\s]/ ) {
print $c }}' < infile.txt | perl -ne 'my @fields=split(); foreach my
$f ( @fields ) {print $f,"\n"}' | sort -u
0b80 - 0bff is tamil unicode range .Use gucharmap or kcharselect to
findout the ranges.
infile.txt is your inputfile
Above script sorts the words and remove duplicates too.