Hi Joe, Once you have the content you might look at the nltk. It's a python library for NLP that includes a number of tools for feature extraction that might be useful. Best, Collin. On Tue, 12 Mar 2013, Joseph E. Beck wrote: > Wow, a very diverse set of replies. I guess I should scope out our current > approach, and what I'm hoping exists. > > We have a C# web client that we are using to download and process a web > page, so we're able to get the content ok--the problem is what can we do > with it? Our goal is to convert the content of the page into features for > predicting a page's educational efficacy. Some features are easy, such as > determining the number of images or number of words. Some are harder, such > as determining whether there are any movies on a page, or the reading > complexity of the text on the page. The former is difficult because there > is not one way to include a movie; the latter is hard because webpages > frequently lack punctuation or formal sentences. > > What gave me cause for optimism was finding sites like wholinks2me.com, > which provide information about a page that I would not, even in principle, > know how to compute, such as frequent search terms used to find the page. > Also, Wolfram Alpha provides an interesting structural analysis of a page. > > > Those two tools focus on understanding the structure of the page; we were > hoping something similar existed for understanding the content on a web > page, such as text complexity, number of movies, how old the technology > they're using is (or whatever else clever folks have come up with). I > don't know if this would be a website that analyzes other websites (like > wholinks2me.com), or some libraries where someone has created such > functions. > > At present, our problem isn't massive scale. We're only looking at 550 web > pages now, and in the near term it probably wouldn't need to go much beyond > 25,000. > > If the above like we're a bit naive and starting a new project, it's > because we are :-) > > joe > > > On Thu, Mar 7, 2013 at 7:45 AM, Nidhi Chopra <nidhi.chopra@xxxxxxxxx> wrote: > > > In TTS (text to speech) mp3 files are opened in Visual C++ to view > > contents, after changing extension name of the file. Then code can be > > written in C/C++ to read the files & perform other operations. This is the > > summary of my 6 months project in I did in my Masters. > > > > Thinking on these line, you have to open the saved page in notpad/txt and > > read contents, look for keywords (TAGS in HTML language) that specify type > > of file. Then write code to do what you are doing manually using ctrl > > function. Or have you tried this already? > > > > Nidhi Chopra > > Delhi, India > > > > > > On Thu, Mar 7, 2013 at 4:54 AM, Joseph E. Beck <josephbeck@xxxxxxx> wrote: > > > >> Hello, we're working on a project determining the educational efficacy of > >> webpages. I am wondering if anyone knows of a resource for computing > >> properties of the webpage itself. Even relatively simple-sounding > >> concepts, such as whether there is a movie, can be difficult to compute. > >> So we'd prefer to leverage off of someone else's work :-) Has anyone > >> come across such tools in their work? > >> > >> Thanks. > >> > >> joe > >> > >> -- > >> Joseph E. Beck > >> Assistant Professor > >> Computer Science Department, Fuller Labs 138 > >> Worcester Polytechnic Institute > >> > > > > > > > -- > Joseph E. Beck > Assistant Professor > Computer Science Department, Fuller Labs 138 > Worcester Polytechnic Institute >