WoWInterface - Pattern matching optimization

WoWInterface (https://www.wowinterface.com/forums/index.php)

- Lua/XML Help (https://www.wowinterface.com/forums/forumdisplay.php?f=16)

- - Pattern matching optimization (https://www.wowinterface.com/forums/showthread.php?t=52746)

MunkDev

09-17-15 09:21 PM

Pattern matching optimization

My understanding of pattern matching and string manipulation in lua is limited at best. I'm currently developing a library that will use a tree structured dictionary to look up possible suggestions as the user inputs text into an EditBox.

The dictionary will be split into two parts; one static, default dictionary with pre-defined entries and one dynamic dictionary which uses the current text of the EditBox to further add suggestions.

My question is regarding how I can strip a lengthy piece of text of all characters that are not words in the best way. My current approach is using a table of "forbidden" characters which is then used to replace each occurrence with whitespace. After that, all multiple occurences of whitespace are replaced by single whitespaces before splitting the string into table entries. This results in a lot of repeats in the returned substrings because syntax is rarely used just once. Can this be done more efficiently?

This is what the code looks like:

Lua Code:

-- Byte table with forbidden characters
local splitByte = {
     [1]   = true, -- no idea
     [10]  = true, -- newline
     [32]  = true, -- space
     [34]  = true, -- ""
     [35]  = true, -- #
     [37]  = true, -- %
     [39]  = true, -- '
     [40]  = true, -- (
     [41]  = true, -- )
     [42]  = true, -- *
     [43]  = true, -- +
     [44]  = true, -- ,
     [45]  = true, -- -
     [46]  = true, -- .
     [47]  = true, -- /
     [48]  = true, -- 0
     [49]  = true, -- 1
     [50]  = true, -- 2
     [51]  = true, -- 3
     [52]  = true, -- 4
     [53]  = true, -- 5
     [54]  = true, -- 6
     [55]  = true, -- 7
     [56]  = true, -- 8
     [57]  = true, -- 9
     [58]  = true, -- :
     [59]  = true, -- ;
     [60]  = true, -- <
     [62]  = true, -- >
     [61]  = true, -- =
     [91]  = true, -- [
     [93]  = true, -- ]
     [94]  = true, -- ^
     [123] = true, -- {
     [124] = true, -- |
     [125] = true, -- }
     [126] = true, -- ~
}
 
local n = CodeMonkeyNotepad -- the editbox 
local text = n:GetText() -- get the full text string
local space = strchar(32)
 
-- Replace with space
for k, v in pairsByKeys(splitByte) do
     -- treat numbers differently
     if k < 48 or k > 57 then
          text = text:gsub("%"..strchar(k), space)
     else
          text = text:gsub(strchar(k), space)
     end
end
 
-- Remove multiple spaces
for i=10, 2, -1 do
     text = text:gsub(strrep(space, i), space)
end
 
-- Collect words in table
local words = {}
for k, v in pairsByKeys({strsplit(space, text)}) do
     -- ignore single letters
     if v:len() > 1 then
          words[v] = true
     end
end

Here's an example output, using the actual code as text inside the EditBox:

Note: pairsByKeys is just a custom table iterator that sorts by key.

Lombra

09-18-15 07:13 AM

Not entirely sure how you're wanting to define "words". If it's just letters, you can use the 'non letter' class %A.

Code:

text:gsub("%A+", " ")

+ matches occurences of one or more continguous characters.

You can do the same for spaces, instead of checking for different lengths in separate steps.

Code:

text:gsub("%s+", " ")

%s matches whitespace.

If you're looking to extract keywords and variables though, you might want to also consider underscores and non leading digits.

MunkDev

09-18-15 07:56 AM

Quote:

Originally Posted by Lombra (Post 311037)

Not entirely sure how you're wanting to define "words". If it's just letters, you can use the 'non letter' class %A.

If you're looking to extract keywords and variables though, you might want to also consider underscores and non leading digits.

So let's say I want to omit standalone numbers, but keep stuff like "ContainerFrame1Item1" without it being split into "ContainerFrame" and "Item", how would I go about doing that, considering "%A+" will just get rid of all those numbers?

Phanx

09-18-15 08:00 AM

Character classes like %A are not localized, so that approach wouldn't work for non-English users.

For removing multiple spaces, I'd strongly recommend Lombra's solution, though I'd amend it to only bother performing a replacement if there's more than one space:

Code:

text = gsub(text, "%s%s+", " ")

For removing "forbidden" characters, rather than using a table and a bunch of gsub and strchar operations, just use a character class and a single operation:

Code:

text = gsub(text, "[\1\10\32\34#%'\*\+,\-\./%d:;<>=\[\]^{|}~]", "")

If you still need the table for other purposes, you can build the character class at load-time by iterating over the table, but don't forget to escape "magic characters".

Edit: To remove only standalone numbers, use a frontier pattern:

Code:

text = gsub(text, "%f[%a%d]%d+%f[%A%D]", "")

MunkDev

09-18-15 08:14 AM

Code:

text = gsub(text, "[\1\10\32\34#%'\*\+,\-\./%d:;<>=\[\]^{|}~]", "")

This doesn't remove any of the characters supplied in the character class.

Phanx

09-18-15 08:19 AM

Oh, oops, I've been writing too much JavaScript at work. Try escaping the characters correctly for Lua. :p

Code:

text = gsub(text, "[\1\10\32\34#%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", "")

MunkDev

09-18-15 08:25 AM

Quote:

Originally Posted by Phanx (Post 311047)

Oh, oops, I've been writing too much JavaScript at work. Try escaping the characters correctly for Lua. :p

Code:

text = gsub(text, "[\1\10\32\34#%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", "")

Yeah, I just edited the pattern to this:

Code:

text = text:gsub("[\1\10\32\34%\\#%%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", space)

... and it seemed to do the trick. :p
One thing missing in the pattern you supplied, is escaping backslash!

One last thing, omitting single letters. At this point, using these three:

Code:

text = text:gsub("[\1\10\32\34\92#%%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", space)

text = text:gsub("%s%s+", space)

text = text:gsub("%f[%a%d]%d+%f[%A%D]", "")

... there will be only words left, apart from single letter variables. They are pretty useless in a dictionary.

MunkDev

09-18-15 10:17 AM

Code:

text = text:gsub("[\1\10\32\34\92#%%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", space)

text = text:gsub("%s%a%s", space)

text = text:gsub("%s%a%s", space)

text = text:gsub("%s%s+", space)

This works, but repeats the operation twice to get rid of the cases where two single letters are separated by only one whitespace, such as..

Lua Code:

-- Original code
function someFunction(t,f) print(t,f) end
 
-- Step 1
function someFunction t f print t f end
 
-- Step 2
function someFunction f print f end
 
-- Step 3
function someFunction  print  end
 
-- End result
function someFunction print end

SDPhantom

09-18-15 01:20 PM

Quote:

Originally Posted by MunkDev (Post 311049)

Code:

text = text:gsub("[\1\10\32\34\92#%%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", space)

You might want to remove \32 because it ends up replacing spaces with itself.
The way Lua handles strings, there's no need to store it in a constant. It's actually using up more resources to do so.

Quote:

Originally Posted by MunkDev (Post 311030)

Code:

local space = strchar(32)

You should never use this function with a static number.
Just use a literal string in the gsub() call as noted previously.

semlar

09-18-15 02:04 PM

It's possible I'm misunderstanding what the goal is here, but why aren't you just matching the word pattern you're searching for?

eg.

Lua Code:

for word in string.gmatch('function someFunction(t,f) print(t,f) end', '[_%a][_%w]+') do print(word) end

Code:

function

someFunction

print

end

MunkDev

09-19-15 02:24 AM

Quote:

Originally Posted by semlar (Post 311054)

It's possible I'm misunderstanding what the goal is here, but why aren't you just matching the word pattern you're searching for?

eg.

Lua Code:

for word in string.gmatch('function someFunction(t,f) print(t,f) end', '[_%a][_%w]+') do print(word) end

This is a very simple and effective approach. Could you please break down how the pattern works?

semlar

09-19-15 01:02 PM

Quote:

Originally Posted by MunkDev (Post 311064)

This is a very simple and effective approach. Could you please break down how the pattern works?

%a represents all letters, %w represents all letters and numbers.

[_%a][_%w]+ says to match an underscore or a letter, followed by at least one underscore or letter or number.

Without defining word boundaries this approach isn't perfect, if your source text contains something like 123InvalidVariable-Name, it will match "InvalidVariable".

Lua doesn't support lookahead/behind, but you may be able to surround your pattern with a character class including whitespace, parentheses, commas, etc. which must come before and after your word to be considered valid.

Here's a potential pattern you can try based on phanx's character class earlier..

Lua Code:

for word in string.gmatch([[function
someFunction(t,f) print(to,fr) _ContainerFrame1_1 end]], '%f[^\0%s\'\"\\/#%%()*+,.:;<>=%[%]{|}~]([_%a][_%w]+)%f[\0%s\'\"\\/#%%()*+,.:;<>=%[%]{|}~]') do print(word) end

SDPhantom

09-20-15 05:48 PM

How about something like this?

Code:

for word in string.gmatch(" "..code,"[ ;]+([_%a][_%w]*) do

This allows single character names by making the second character optional (* defines 0 or more, tries to match as many as possible). This also injects a delimiter before the string to make the matching work correctly. There are also problems that could happen depending on where spaces are put that is valid Lua code, but would mess up this tokenizer.

MunkDev

09-20-15 06:18 PM

I can see how in the interest of writing code, it would seem appropriate to filter out words attached to malformed snippets, but really, this library should aim to be as generally applicable as possible. I will port it for use with other stuff, like a chat addon or taking notes on screen. Might even be useful for my controller addon, considering I haven't built chat functionality for it yet. Besides, my editor debugs code automatically and will throw an error when the user has invalid variable names.

Let me ask this instead, as a programmer using an IDE, would you rather it store "ContainerFrame" and "Item" in suggestions, or "ContainerFrame4Item16", if that was what you originally entered? I can't make up my mind on that point, considering there are some cases where you want to consistently use the same numbered variable and some cases where you want to change the numbers.

semlar

09-20-15 08:19 PM

Quote:

Originally Posted by MunkDev (Post 311098)

Let me ask this instead, as a programmer using an IDE, would you rather it store "ContainerFrame" and "Item" in suggestions, or "ContainerFrame4Item16", if that was what you originally entered?

I think it should only be suggesting full variable names that actually exist within the current document, a partial match would only be confusing.

kurapica.igas

09-20-15 10:07 PM

1 Attachment(s)

Normally, we only need the keywords and identifiers for auto-completation.

I see you are using the kristofer's Indent.lua. I think you can record those words when the lib scan for identifiers(tokentype == tokens.TOKEN_IDENTIFIER, record it) .

Using a dichotomy search can quickly check if the identifier existed or not, or just keep a cache table for checking.

Making an editor is fun, but with wow, it's a pain in the ass.

Attachment 8650

MunkDev

09-23-15 08:34 AM

Quote:

Originally Posted by kurapica.igas (Post 311100)

Do you have a code snippet for that? I was thinking of thinning out the table by matching each letter, which will of course remove unwanted results with each iteration, but it's by no means a binary algorithm.

kurapica.igas

09-23-15 09:43 AM

Well, I don't use the Indent.lua, my code can be found in CodeEditor

I don't use a tree table to store the keywords, only one sorted table is used, like self.AutoCompleteList contains all identifiers(self is the editor).

I give each match words a weight, that based on the input word, to make sure the first word is the most wanted word.

To create a list based on the input word, the core func in the file is

Compare -- Used to compare words without case.
CompareWeight -- Used to compare two word's weight that based on the input word.
GetIndex -- Use dichotomy query to find the index of the word
TransMatchWord -- Create a match pattern based on the input word, 'pt' -> '^([pP])([%w_]-)([tT])([%w_]-)$', this would be used on match words, so weight and color can be applied to those words(check the pic).
ApplyColor -- Apply color to the match word.
ApplyAutoComplete -- The main func.

The filter job is done in those code :

Lua Code:

local uword = "^" .. word:gsub("[%w_]", TransMatchWord) .. "$"  -- the pattern
local header = word:sub(1, 1)       -- the first char of the match word must be the same to the input word
 
local lst = self.AutoCompleteList    -- the sorted identifiers list based on the Compare func
local sIdx = GetIndex(lst, header) -- get start index based on the first char, where we start checking
 
if sIdx == 0 then sIdx = 1 end
 
for i = sIdx, #lst do
    local value = lst[i]
    if #value == 0 or Compare(header, value:sub(1, 1)) then break end  -- no need to continue
 
    _AutoCheckWord = value  -- Keep word to be used in ApplyColor
 
    if _AutoCheckWord:match(uword) then
        _AutoCheckWord:gsub(uword, ApplyColor)
 
        tinsert(_AutoCacheKeys, _AutoCheckWord)
    end
end
 
-- ...
Array.Sort(_AutoCacheKeys, CompareWeight)  -- Sort match words based on their weight

May it can help you.

All times are GMT -6. The time now is 08:30 AM.