It is also way of indexing that matters. I would love to work on this
project. But there is a warning, I worked on Tamil Corpus before, but it
went no where. It will be good to have active participation of all people
involved. Plus funds should be collected for this project because it could
Than big data, its the way in which things are indexed matters, lets say
one server responds to words beginning with à® and other just responds with
words beginning with à® this kind of shared responsibility will make things
fast. As we progress we could put another server that looks at soundex and
spell suggest. As the project moves on, we must look at things that are
been searched and must build our infrastructure around it.
On Sat, Oct 8, 2016 at 2:37 AM, Shrinivasan T <tshrinivasan@xxxxxxxxx>
Many tamil scholars are looking for a search engine for tamil literatures.
They often look for the following things.
1. search for any word in all literature. highlight the line of
occurrence, if possible one line above and below.
2. frequency of any given words
3. major used, minor used words by any given author
The literature are available in text format here.
There are people who scraps tamil websites regularly.
They have around 180 GB of tamil in plain text format.
When they do a grep for any word it tools 8-10 hours on normal desktop.
I think we can use bigdata tools for them.
Can we use elasticsearch/druid for their purpose?
How to import the plaintext to these tools?
share your thoughts on this.
My Life with GNU/Linux : http://goinggnu.wordpress.com
Free E-Magazine on Free Open Source Software in Tamil : http://kaniyam.com
Get Free Tamil Ebooks for Android, iOS, Kindle, Computer :
ILUGC List: http://www.freelists.org/list/ilugc
ILUGC Web: http://ilugc.in/