Ticket #689 (new enhancement)

Opened 1 year ago

modifying preparse.py to benefit from the Python standard tokenize module

Reported by: was Assigned to: somebody
Priority: major Milestone: sage-wishlist
Component: basic arithmetic Keywords:
Cc:

Description

On 9/18/07, Joel B. Mohler <joel@kiwistrawberry.us> wrote:

    I don't have a strong opinion about the original proposition ... although I
    have spent the last week being thoroughly vexed by python being zero-based
    when all the things I would write on paper about math would index things
    one-based.


One solution to that is to just write things 0-based in your math papers. 
I've been doing that in all the mathematics I write for the last 2 years
or so, and nobody complains.   Consider an $n$-dimensional vector
space with basis $v_0, \ldots, v_{n-1}$. 

    However, I did want to say this about the preparser.  My impression is that it
    is written by doing very manual character/string reading.  I think
    the 'tokenize' module could probably make the code much much cleaner.  It has
    annoying (but understandable) semantics.  Here's an example below.


True.  I've been looking forward to somebody rewriting the preparse(...)
function for a long time.  Much to my surprise nobody has.  I don't think
anybody has even tried, though people have suggested they would try
on a few occasions. I would certainly be happy if it were to happen, though
I'm not going to do it myself, since the current version works perfectly
well, is easy to understand, and I have other things that I have to do.
Also, speeding up the preparser might have little noticeable effect on the
speed of Sage, since all actual work / computation happens after preparsing.
That said, if you do
    sage: search_src('sage_eval')
you'll see 83 places in the Sage library code itself that involve calling
the Sage preparser, and those could be faster. 

Using tokenize as you illustrate below would probably be a simple
intermediate way to rewrite the preparser without going nuts trying
to do something too complicated.  E.g., the preparser currently
has code to find string literals in the input line -- the tokenize module
does the same thing, probably much better.   Also the preparser
has code to parse out valid identifiers, and again the tokenize module
below does that automatically.

I think "modifying preparse.py to benefit from the Python standard tokenize module" would be a good trac enhancement ticket:
  

    --
    Joel

    import tokenize

    # helper class to generate a stream from a string
    class line_token_stream:
        def __init__( self, s ):
            self.line = s

        def __call__( self ):
            if self.line:
                s = self.line
                self.line = None
                return s
            raise StopIteration

    def tokenize_line( s ):
        for t in  tokenize.generate_tokens( line_token_stream( s ) ):
            yield t

    for i in tokenize_line( "func(1+2)" ):
        print i