Hi all, sorry for the long delay in replying, I got a bit distracted by other things. On 2004-03-22 13:51:14 [+0100], Oliver Tappe wrote: >On 2004-03-21 at 18:12:16 [+0100], Christian Packmann wrote: > Yes, we should include all API-words into the keywords-files, but maybe > in a somewhat more structured fashion. If a user decides that he doesn't > want the keywords of the API highlighted, he can just make Pe display > them in the same color as normal text. However, currently there's no way > the user can find out which of user-defined-keyword-colors 1-4 he has to > change. I suggest that we either come up with a convention (e.g. always > use user-set 4 for API keywords in all keyword-files) or we make that > explicit by adding one or more user-sets that are properly named (e.g.: > API-types, API-functions, etc.). Just using a convention is (much) less > work, I suppose... Extending the syntax-coloring system would be a good idea IMHO, especially as there are more classes of keywords than can currently be supported. I think that these keyword classes should be supported (with some C/C++ examples): keywords - for, while, continue, break, inline types - int, char, void, short, long global functions - strlen, strcpy, fopen, fread constants - EOF, SEEK_CUR, LONG_MIN, EXIT_SUCCESS class names - vector, map, string class function names - resize, clear, push_back, find This system should be applicable to all modern OO languages. When extending the keyword file to include API definitions, the keywords can be added to the appropriate sections (except the first, which only includes definitions for the language itself). >> I've looked at CKeywords.cpp, and I think the problem might be in >> CompressDFA() with the use of <vector> for >> nxt = vector<int>(b); >> chk = vector<int>(b); >> without doing a .reserve() for those, with > Hm, as far as I know, the vector<int>(b)-part above *does* reserve b > entries in the vector, doesn't it? Yes, but there's a slight difference between setting the size and reserving entries. Initial allocation and calling resize() sets the size of the vector to the given number of elements, and allocates enough memory for the elements. A call to reserve() does not change the element count, but allocates enough memory for the given element count. If many single elements are then added with push_back(), there will be no additional memory allocation required until the size given in reserve() is exceeded. This helps performance when many elements are added singly, as each internal resizing of the vector implies that the whole array contents have to be copied from old storage to the new; especially with big vectors this quickly gets expensive. But this is obviously not the problem here. > A temporarily added reserve()-call did > not change the startup-situation noticably, so I'm afraid that's not > quite it... Maybe someone should start a profiler on this one? I ripped the core routines out and transplanted them into a test program, testing this against my keywords.cpp. Profiling turned out that most of the execution time is spent in accessing <vector> elements, not in (re)allocation of vectors. It is possible to optimize this slightly by using normal C arrays and doing the array managment manually, but I doubt this would do much good, as computation times rise exponentially with rising word count. I tested the runtime behavior against different versions of my keywords.cpp (on Athlon XP/2GHz), here's a table with all revelevant parameters (DFA = Deterministic Finite Automaton): WC = word count DFA = number of DFA states DFA/WC = DFA states per word sec. = seconds to process keywords.cpp mSec/WC = milliseconds per word mSec/DFA = milliseconds per DFA state WC DFA DFA/WC sec. mSec/WC mSec/DFA 1031 6541 6.34 4.73 4.59 0.723 1146 7830 6.83 6.73 5.87 0.859 1316 9532 7.24 10.54 8.01 1.106 1660 12676 7.69 17.01 10.24 1.342 2628 20585 7.83 47.00 17.88 2.280 As can be seen, the parsing time quickly deteriorates with rising word count. AFAICS this is due to the more complex DFA states that have to be processed when there are many keywords; the inner loops within the main loop are executed more often, and combined with the increasing loop count of the main loop, this gives exponential computation time increase. A complete keywords.cpp would have about 4000-5000 words; compression would likely take 2-3 minutes on my machine, and I don't want to think about slower machines... Another thing is memory usage; as each kws structure used to store a DFA state takes 260 bytes, temporary memory usage for the DFA map is already beyond 5 MB for the last case above; for a complete keywords.cpp this would likely be in excess of 10 MB. Not much of an issue nowadays, but still not too nice. The only solution with keeping the current approach would be complete rewrite of the whole DFA handling. If somebody who actually understands all that code is interested... ;) Otherwise a different approach would be required. I've done a few tests with STL maps, and think they offer an acceptable solution. I used a map<string,uint8> for this; the uint8 is used to store the keyword class a specific entry belongs to. Lookup times in such a map containing the current keywords.cpp is about a millionth second on Athlon 2GHz; parsing of the keyword file into a map takes about 0.013 seconds (this is with a complete parser for six keyword classes, which also allows comments within keyword files). I've also tested this on a PentiumMMX/200, where the code is about 10 times slower, still pretty okay IMO - parsing of the most complex keyword files would take <0.3 seconds, and could be done during Pe startup. IMO this would be better than having to wait for ~20 minutes after loading a modified keyword file. ;) Regrettably I won't have the time to implement this in Pe. If anybody else is interested, let me know so I can send you the parser and other required code snippets. Bye, Chris