[edm-discuss] Re: Anyone work on web mining and feature generation? (fwd)

  • From: Collin Lynch <collinl@xxxxxxxxxxx>
  • To: edm-discuss@xxxxxxxxxxxxx
  • Date: Thu, 14 Mar 2013 12:40:47 -0400 (EDT)

Hi Joe,  Once you have the content you might look at the nltk.  It's a
python library for NLP that includes a number of tools for feature
extraction that might be useful.

        Best,
        Collin.

On Tue, 12 Mar 2013, Joseph E. Beck wrote:

> Wow, a very diverse set of replies.  I guess I should scope out our current
> approach, and what I'm hoping exists.
>
> We have a C# web client that we are using to download and process a web
> page, so we're able to get the content ok--the problem is what can we do
> with it?  Our goal is to convert the content of the page into features for
> predicting a page's educational efficacy.  Some features are easy, such as
> determining the number of images or number of words.  Some are harder, such
> as determining whether there are any movies on a page, or the reading
> complexity of the text on the page.  The former is difficult because there
> is not one way to include a movie; the latter is hard because webpages
> frequently lack punctuation or formal sentences.
>
> What gave me cause for optimism was finding sites like wholinks2me.com,
> which provide information about a page that I would not, even in principle,
> know how to compute, such as frequent search terms used to find the page.
>  Also, Wolfram Alpha provides an interesting structural analysis of a page.
>
>
> Those two tools focus on understanding the structure of the page; we were
> hoping something similar existed for understanding the content on a web
> page, such as text complexity, number of movies, how old the technology
> they're using is (or whatever else clever folks have come up with).  I
> don't know if this would be a website that analyzes other websites (like
> wholinks2me.com), or some libraries where someone has created such
> functions.
>
> At present, our problem isn't massive scale.  We're only looking at 550 web
> pages now, and in the near term it probably wouldn't need to go much beyond
> 25,000.
>
> If the above like we're a bit naive and starting a new project, it's
> because we are :-)
>
> joe
>
>
> On Thu, Mar 7, 2013 at 7:45 AM, Nidhi Chopra <nidhi.chopra@xxxxxxxxx> wrote:
>
> > In TTS (text to speech) mp3 files are opened in Visual C++ to view
> > contents, after changing extension name of the file. Then code can be
> > written in C/C++ to read the files & perform other operations. This is the
> > summary of my 6 months project in I did in my Masters.
> >
> > Thinking on these line, you have to open the saved page in notpad/txt and
> > read contents, look for keywords (TAGS in HTML language) that specify type
> > of file. Then write code to do what you are doing manually using ctrl
> > function. Or have you tried this already?
> >
> > Nidhi Chopra
> > Delhi, India
> >
> >
> > On Thu, Mar 7, 2013 at 4:54 AM, Joseph E. Beck <josephbeck@xxxxxxx> wrote:
> >
> >> Hello, we're working on a project determining the educational efficacy of
> >> webpages.  I am wondering if anyone knows of a resource for computing
> >> properties of the webpage itself.  Even relatively simple-sounding
> >> concepts, such as whether there is a movie, can be difficult to compute.
> >>  So we'd prefer to leverage off of someone else's work :-)   Has anyone
> >> come across such tools in their work?
> >>
> >> Thanks.
> >>
> >> joe
> >>
> >> --
> >> Joseph E. Beck
> >> Assistant Professor
> >> Computer Science Department, Fuller Labs 138
> >> Worcester Polytechnic Institute
> >>
> >
> >
>
>
> --
> Joseph E. Beck
> Assistant Professor
> Computer Science Department, Fuller Labs 138
> Worcester Polytechnic Institute
>




Other related posts: