Pattern matching optimization
My understanding of pattern matching and string manipulation in lua is limited at best. I'm currently developing a library that will use a tree structured dictionary to look up possible suggestions as the user inputs text into an EditBox.
The dictionary will be split into two parts; one static, default dictionary with pre-defined entries and one dynamic dictionary which uses the current text of the EditBox to further add suggestions. My question is regarding how I can strip a lengthy piece of text of all characters that are not words in the best way. My current approach is using a table of "forbidden" characters which is then used to replace each occurrence with whitespace. After that, all multiple occurences of whitespace are replaced by single whitespaces before splitting the string into table entries. This results in a lot of repeats in the returned substrings because syntax is rarely used just once. Can this be done more efficiently? This is what the code looks like: Lua Code:
Note: pairsByKeys is just a custom table iterator that sorts by key. |
Not entirely sure how you're wanting to define "words". If it's just letters, you can use the 'non letter' class %A.
Code:
text:gsub("%A+", " ") You can do the same for spaces, instead of checking for different lengths in separate steps. Code:
text:gsub("%s+", " ") If you're looking to extract keywords and variables though, you might want to also consider underscores and non leading digits. |
Quote:
|
Character classes like %A are not localized, so that approach wouldn't work for non-English users.
For removing multiple spaces, I'd strongly recommend Lombra's solution, though I'd amend it to only bother performing a replacement if there's more than one space: Code:
text = gsub(text, "%s%s+", " ") Code:
text = gsub(text, "[\1\10\32\34#%'\(\)\*\+,\-\./%d:;<>=\[\]^{|}~]", "") Edit: To remove only standalone numbers, use a frontier pattern: Code:
text = gsub(text, "%f[%a%d]%d+%f[%A%D]", "") |
Code:
text = gsub(text, "[\1\10\32\34#%'\(\)\*\+,\-\./%d:;<>=\[\]^{|}~]", "") |
Oh, oops, I've been writing too much JavaScript at work. Try escaping the characters correctly for Lua. :p
Code:
text = gsub(text, "[\1\10\32\34#%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", "") |
Quote:
Code:
text = text:gsub("[\1\10\32\34%\\#%%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", space) One thing missing in the pattern you supplied, is escaping backslash! One last thing, omitting single letters. At this point, using these three: Code:
text = text:gsub("[\1\10\32\34\92#%%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", space) |
Code:
text = text:gsub("[\1\10\32\34\92#%%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", space) Lua Code:
|
Quote:
The way Lua handles strings, there's no need to store it in a constant. It's actually using up more resources to do so. Quote:
Just use a literal string in the gsub() call as noted previously. |
It's possible I'm misunderstanding what the goal is here, but why aren't you just matching the word pattern you're searching for?
eg. Lua Code:
Code:
function |
Quote:
|
Quote:
[_%a][_%w]+ says to match an underscore or a letter, followed by at least one underscore or letter or number. Without defining word boundaries this approach isn't perfect, if your source text contains something like 123InvalidVariable-Name, it will match "InvalidVariable". Lua doesn't support lookahead/behind, but you may be able to surround your pattern with a character class including whitespace, parentheses, commas, etc. which must come before and after your word to be considered valid. Here's a potential pattern you can try based on phanx's character class earlier.. Lua Code:
|
How about something like this?
Code:
for word in string.gmatch(" "..code,"[ ;]+([_%a][_%w]*) do |
I can see how in the interest of writing code, it would seem appropriate to filter out words attached to malformed snippets, but really, this library should aim to be as generally applicable as possible. I will port it for use with other stuff, like a chat addon or taking notes on screen. Might even be useful for my controller addon, considering I haven't built chat functionality for it yet. Besides, my editor debugs code automatically and will throw an error when the user has invalid variable names.
Let me ask this instead, as a programmer using an IDE, would you rather it store "ContainerFrame" and "Item" in suggestions, or "ContainerFrame4Item16", if that was what you originally entered? I can't make up my mind on that point, considering there are some cases where you want to consistently use the same numbered variable and some cases where you want to change the numbers. |
Quote:
|
1 Attachment(s)
Normally, we only need the keywords and identifiers for auto-completation.
I see you are using the kristofer's Indent.lua. I think you can record those words when the lib scan for identifiers(tokentype == tokens.TOKEN_IDENTIFIER, record it) . Using a dichotomy search can quickly check if the identifier existed or not, or just keep a cache table for checking. Making an editor is fun, but with wow, it's a pain in the ass. Attachment 8650 |
Quote:
|
Well, I don't use the Indent.lua, my code can be found in CodeEditor
I don't use a tree table to store the keywords, only one sorted table is used, like self.AutoCompleteList contains all identifiers(self is the editor). I give each match words a weight, that based on the input word, to make sure the first word is the most wanted word. To create a list based on the input word, the core func in the file is
The filter job is done in those code : Lua Code:
May it can help you. |
All times are GMT -6. The time now is 08:30 AM. |
vBulletin © 2024, Jelsoft Enterprises Ltd
© 2004 - 2022 MMOUI