[edm-discuss] Re: Anyone work on web mining and feature generation?

  • From: "Joseph E. Beck" <josephbeck@xxxxxxx>
  • To: edm-discuss@xxxxxxxxxxxxx
  • Date: Tue, 12 Mar 2013 00:58:48 -0400

Wow, a very diverse set of replies.  I guess I should scope out our current
approach, and what I'm hoping exists.

We have a C# web client that we are using to download and process a web
page, so we're able to get the content ok--the problem is what can we do
with it?  Our goal is to convert the content of the page into features for
predicting a page's educational efficacy.  Some features are easy, such as
determining the number of images or number of words.  Some are harder, such
as determining whether there are any movies on a page, or the reading
complexity of the text on the page.  The former is difficult because there
is not one way to include a movie; the latter is hard because webpages
frequently lack punctuation or formal sentences.

What gave me cause for optimism was finding sites like wholinks2me.com,
which provide information about a page that I would not, even in principle,
know how to compute, such as frequent search terms used to find the page.
 Also, Wolfram Alpha provides an interesting structural analysis of a page.


Those two tools focus on understanding the structure of the page; we were
hoping something similar existed for understanding the content on a web
page, such as text complexity, number of movies, how old the technology
they're using is (or whatever else clever folks have come up with).  I
don't know if this would be a website that analyzes other websites (like
wholinks2me.com), or some libraries where someone has created such
functions.

At present, our problem isn't massive scale.  We're only looking at 550 web
pages now, and in the near term it probably wouldn't need to go much beyond
25,000.

If the above like we're a bit naive and starting a new project, it's
because we are :-)

joe


On Thu, Mar 7, 2013 at 7:45 AM, Nidhi Chopra <nidhi.chopra@xxxxxxxxx> wrote:

> In TTS (text to speech) mp3 files are opened in Visual C++ to view
> contents, after changing extension name of the file. Then code can be
> written in C/C++ to read the files & perform other operations. This is the
> summary of my 6 months project in I did in my Masters.
>
> Thinking on these line, you have to open the saved page in notpad/txt and
> read contents, look for keywords (TAGS in HTML language) that specify type
> of file. Then write code to do what you are doing manually using ctrl
> function. Or have you tried this already?
>
> Nidhi Chopra
> Delhi, India
>
>
> On Thu, Mar 7, 2013 at 4:54 AM, Joseph E. Beck <josephbeck@xxxxxxx> wrote:
>
>> Hello, we're working on a project determining the educational efficacy of
>> webpages.  I am wondering if anyone knows of a resource for computing
>> properties of the webpage itself.  Even relatively simple-sounding
>> concepts, such as whether there is a movie, can be difficult to compute.
>>  So we'd prefer to leverage off of someone else's work :-)   Has anyone
>> come across such tools in their work?
>>
>> Thanks.
>>
>> joe
>>
>> --
>> Joseph E. Beck
>> Assistant Professor
>> Computer Science Department, Fuller Labs 138
>> Worcester Polytechnic Institute
>>
>
>


-- 
Joseph E. Beck
Assistant Professor
Computer Science Department, Fuller Labs 138
Worcester Polytechnic Institute

Other related posts: