[recoll-user] Re: Indexing python Files

  • From: Linos <info@xxxxxxxx>
  • To: recoll-user@xxxxxxxxxxxxx
  • Date: Sun, 14 Sep 2008 14:44:49 +0200

Hi Jean-Francois,
to convert it to a pretty html colored output i am using the script i have found here "http://chrisarndt.de/en/software/python/colorize.html"; it is pure python, i have found any other tricks to make the same with vim or enscript but i suppose the unique program you can be nearly sure to be in a machine that want to index python files it is python itself, i have renamed the file to rclpython in ~/.recoll directory and now the settings works well in mimemap and mimeconf, i can preview and edit the .py files and the contents are correctly indexed. I am using the icon found in the oxygen kde package, i have copied it to /usr/share/recoll/images and i am using this configs in ~/.recoll

mimeconf:
[index]
text/x-python = exec rclpython

[icons]
text/x-python = text-x-python

[categories]
text = \
      text/x-python

mimemap:
.py = text/x-python

mimeview:
[view]
text/x-python = idle %f

I am using idle for editor because it is the only editor shipping with python. You can find the icon and the rclpython in the attached files, the icon set and the script to do the convert are behind gpl so you should have no probles to include it in the official distribution. Thanks for the hint.

Regards,
Miguel Angel.


Jean-Francois Dockes escribió:
Hi Linos,

You can't just say "text/x-python = internal", this would suppose that the
c++ code  knows what to do with text/x-python, which it doesnt

You have 2 possibilities:
1- Either add something like the following to mimemap:

.py = text/plain

 Then python files will just be indexed as text, but you lose the ability
 to have a specific viewer/icons etc.


2- Or add ".py = text/x-python" to mimemap, but then you need to add a
   filter script for python files. Add something like the following to
mimeconf:
text/x-python = exec rclpython

 - The rclpython script (which might be written in python by the way...)
   would need to turn the python program into html. Minimally, this means
   emitting a charset meta tag, and just emitting the python text after
   escaping characters like "<", "&". For a simple example, take a look at
   rclman. In fact, I should write a script that would generically do this
   to any text file, maybe I'll do it for the next release.
   Alternatively, maybe someone already wrote a program to turn a python
   program into nice html, then you could just call this from the script.
   The regular rcl... script also do other stuff like checking for external
   programs and emitting specific erors etc., but this is not strictly
   needed, you just need to spit html

Don't hesitate to come back to me if anything is unclear. If you go the
script way and you like the results, I'd be glad to add it to the
distribution so that it will be there for you next release...

Regards,
jf

Linos writes:
> > Hello, > i am trying to get recoll index my python source files, but i am doing anything > wrong because i cant get it to work, i have added this files to my ~/.recoll > directory. > > mimeconf
 > [index]
 > text/x-python = internal
> > [icons]
 > text/x-python = txt
> > [categories]
 > text = \
 >        text/x-python
> > mimemap
 > .py = text/x-python
> > mimeview
 > text/x-python = kwrite %f
> > in the gui interface i can select the type to filter it in advanced search if i > want, but i dont get the files really indexed, only his names, i cant search the > content, obviously the viewer and icon are only to text the indexing function > later i will make it use better editor/icon, i have tried recreating the > complete xapiandb with recollindex -z, and other question, if i add new types > (when you help me with the correct way to do it hehehe) do i have to recreate > the complete index if the files has not been changed and are in a subdirectory > previously indexed? >

#!/usr/bin/python
# -*- coding: iso-8859-1 -*-
"""
    MoinMoin - Python source parser and colorizer
"""

# Based on the code from Jürgen Herman, the following changes where made:
#
# Mike Brown <http://skew.org/~mike/>:
# - make script callable as a CGI and a Apache handler for .py files.
#
# Christopher Arndt <http://chrisarndt.de>:
# - make script usable as a module
# - use class tags and style sheet instead of <style> tags
# - when called as a script, add HTML header and footer
#
# TODO:
#
# - parse script encoding and allow output in any encoding by using unicode
#   as intermediate

__version__ = '0.3'
__date__ = '2005-07-04'
__license__ = 'GPL'
__author__ = 'Jürgen Hermann, Mike Brown, Christopher Arndt'


# Imports
import cgi, string, sys, cStringIO
import keyword, token, tokenize


#############################################################################
### Python Source Parser (does Hilighting)
#############################################################################

_KEYWORD = token.NT_OFFSET + 1
_TEXT    = token.NT_OFFSET + 2

_css_classes = {
    token.NUMBER:       'number',
    token.OP:           'operator',
    token.STRING:       'string',
    tokenize.COMMENT:   'comment',
    token.NAME:         'name',
    token.ERRORTOKEN:   'error',
    _KEYWORD:           'keyword',
    _TEXT:              'text',
}

_HTML_HEADER = """\
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
  "http://www.w3.org/TR/html4/loose.dtd";>
<html>
<head>
  <title>%%(title)s</title>
  <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
  <meta name="Generator" content="colorize.py (version %s)">
</head>
<body>
""" % __version__

_HTML_FOOTER = """\
</body>
</html>
"""

_STYLESHEET = """\
<style type="text/css">
pre.code {
    font-style: Lucida,"Courier New";
}

.number {
    color: #0080C0;
}
.operator {
    color: #000000;
}
.string {
    color: #008000;
}
.comment {
    color: #808080;
}
.name {
    color: #000000;
}
.error {
    color: #FF8080;
    border: solid 1.5pt #FF0000;
}
.keyword {
    color: #0000FF;
    font-weight: bold;
}
.text {
    color: #000000;
}

</style>

"""

class Parser:
    """ Send colored python source.
    """

    stylesheet = _STYLESHEET

    def __init__(self, raw, out=sys.stdout):
        """ Store the source text.
        """
        self.raw = string.strip(string.expandtabs(raw))
        self.out = out

    def format(self):
        """ Parse and send the colored source.
        """
        # store line offsets in self.lines
        self.lines = [0, 0]
        pos = 0
        while 1:
            pos = string.find(self.raw, '\n', pos) + 1
            if not pos: break
            self.lines.append(pos)
        self.lines.append(len(self.raw))

        # parse the source and write it
        self.pos = 0
        text = cStringIO.StringIO(self.raw)
        self.out.write(self.stylesheet)
        self.out.write('<pre class="code">\n')
        try:
            tokenize.tokenize(text.readline, self)
        except tokenize.TokenError, ex:
            msg = ex[0]
            line = ex[1][0]
            self.out.write("<h3>ERROR: %s</h3>%s\n" % (
                msg, self.raw[self.lines[line]:]))
        self.out.write('\n</pre>')

    def __call__(self, toktype, toktext, (srow,scol), (erow,ecol), line):
        """ Token handler.
        """
        if 0:
            print "type", toktype, token.tok_name[toktype], "text", toktext,
            print "start", srow,scol, "end", erow,ecol, "<br>"

        # calculate new positions
        oldpos = self.pos
        newpos = self.lines[srow] + scol
        self.pos = newpos + len(toktext)

        # handle newlines
        if toktype in [token.NEWLINE, tokenize.NL]:
            self.out.write('\n')
            return

        # send the original whitespace, if needed
        if newpos > oldpos:
            self.out.write(self.raw[oldpos:newpos])

        # skip indenting tokens
        if toktype in [token.INDENT, token.DEDENT]:
            self.pos = newpos
            return

        # map token type to a color group
        if token.LPAR <= toktype and toktype <= token.OP:
            toktype = token.OP
        elif toktype == token.NAME and keyword.iskeyword(toktext):
            toktype = _KEYWORD
        css_class = _css_classes.get(toktype, 'text')

        # send text
        self.out.write('<span class="%s">' % (css_class,))
        self.out.write(cgi.escape(toktext))
        self.out.write('</span>')


def colorize_file(file=None, outstream=sys.stdout, standalone=True):
    """Convert a python source file into colorized HTML.

    Reads file and writes to outstream (default sys.stdout). file can be a
    filename or a file-like object (only the read method is used). If file is
    None, act as a filter and read from sys.stdin. If standalone is True
    (default), send a complete HTML document with header and footer. Otherwise
    only a stylesheet and a <pre> section are written.
    """

    from os.path import basename
    if hasattr(file, 'read'):
        sourcefile = file
        file = None
        try:
            filename = basename(file.name)
        except:
            filename = 'STREAM'
    elif file is not None:
        try:
            sourcefile = open(file)
            filename = basename(file)
        except IOError:
            raise SystemExit("File %s unknown." % file)
    else:
        sourcefile = sys.stdin
        filename = 'STDIN'
    source = sourcefile.read()

    if standalone:
        outstream.write(_HTML_HEADER % {'title': filename})
    Parser(source, out=outstream).format()
    if standalone:
        outstream.write(_HTML_FOOTER)

    if file:
        sourcefile.close()

if __name__ == "__main__":
    import os
    if os.environ.get('PATH_TRANSLATED'):
        filepath = os.environ.get('PATH_TRANSLATED')
        print 'Content-Type: text/html; charset="iso-8859-1"\n'
        colorize_file(filepath)
    elif len(sys.argv) > 1:
        filepath = sys.argv[1]
        colorize_file(filepath)
    else:
        colorize_file()

PNG image

Other related posts: