WoWInterface

WoWInterface (https://www.wowinterface.com/forums/index.php)
-   Lua/XML Help (https://www.wowinterface.com/forums/forumdisplay.php?f=16)
-   -   Pattern matching optimization (https://www.wowinterface.com/forums/showthread.php?t=52746)

MunkDev 09-17-15 09:21 PM

Pattern matching optimization
 
My understanding of pattern matching and string manipulation in lua is limited at best. I'm currently developing a library that will use a tree structured dictionary to look up possible suggestions as the user inputs text into an EditBox.

The dictionary will be split into two parts; one static, default dictionary with pre-defined entries and one dynamic dictionary which uses the current text of the EditBox to further add suggestions.

My question is regarding how I can strip a lengthy piece of text of all characters that are not words in the best way. My current approach is using a table of "forbidden" characters which is then used to replace each occurrence with whitespace. After that, all multiple occurences of whitespace are replaced by single whitespaces before splitting the string into table entries. This results in a lot of repeats in the returned substrings because syntax is rarely used just once. Can this be done more efficiently?

This is what the code looks like:
Lua Code:
  1. -- Byte table with forbidden characters
  2. local splitByte = {
  3.      [1]   = true, -- no idea
  4.      [10]  = true, -- newline
  5.      [32]  = true, -- space
  6.      [34]  = true, -- ""
  7.      [35]  = true, -- #
  8.      [37]  = true, -- %
  9.      [39]  = true, -- '
  10.      [40]  = true, -- (
  11.      [41]  = true, -- )
  12.      [42]  = true, -- *
  13.      [43]  = true, -- +
  14.      [44]  = true, -- ,
  15.      [45]  = true, -- -
  16.      [46]  = true, -- .
  17.      [47]  = true, -- /
  18.      [48]  = true, -- 0
  19.      [49]  = true, -- 1
  20.      [50]  = true, -- 2
  21.      [51]  = true, -- 3
  22.      [52]  = true, -- 4
  23.      [53]  = true, -- 5
  24.      [54]  = true, -- 6
  25.      [55]  = true, -- 7
  26.      [56]  = true, -- 8
  27.      [57]  = true, -- 9
  28.      [58]  = true, -- :
  29.      [59]  = true, -- ;
  30.      [60]  = true, -- <
  31.      [62]  = true, -- >
  32.      [61]  = true, -- =
  33.      [91]  = true, -- [
  34.      [93]  = true, -- ]
  35.      [94]  = true, -- ^
  36.      [123] = true, -- {
  37.      [124] = true, -- |
  38.      [125] = true, -- }
  39.      [126] = true, -- ~
  40. }
  41.  
  42. local n = CodeMonkeyNotepad -- the editbox
  43. local text = n:GetText() -- get the full text string
  44. local space = strchar(32)
  45.  
  46. -- Replace with space
  47. for k, v in pairsByKeys(splitByte) do
  48.      -- treat numbers differently
  49.      if k < 48 or k > 57 then
  50.           text = text:gsub("%"..strchar(k), space)
  51.      else
  52.           text = text:gsub(strchar(k), space)
  53.      end
  54. end
  55.  
  56. -- Remove multiple spaces
  57. for i=10, 2, -1 do
  58.      text = text:gsub(strrep(space, i), space)
  59. end
  60.  
  61. -- Collect words in table
  62. local words = {}
  63. for k, v in pairsByKeys({strsplit(space, text)}) do
  64.      -- ignore single letters
  65.      if v:len() > 1 then
  66.           words[v] = true
  67.      end
  68. end
Here's an example output, using the actual code as text inside the EditBox:

Note: pairsByKeys is just a custom table iterator that sorts by key.

Lombra 09-18-15 07:13 AM

Not entirely sure how you're wanting to define "words". If it's just letters, you can use the 'non letter' class %A.

Code:

text:gsub("%A+", " ")
+ matches occurences of one or more continguous characters.

You can do the same for spaces, instead of checking for different lengths in separate steps.
Code:

text:gsub("%s+", " ")
%s matches whitespace.

If you're looking to extract keywords and variables though, you might want to also consider underscores and non leading digits.

MunkDev 09-18-15 07:56 AM

Quote:

Originally Posted by Lombra (Post 311037)
Not entirely sure how you're wanting to define "words". If it's just letters, you can use the 'non letter' class %A.

If you're looking to extract keywords and variables though, you might want to also consider underscores and non leading digits.

So let's say I want to omit standalone numbers, but keep stuff like "ContainerFrame1Item1" without it being split into "ContainerFrame" and "Item", how would I go about doing that, considering "%A+" will just get rid of all those numbers?

Phanx 09-18-15 08:00 AM

Character classes like %A are not localized, so that approach wouldn't work for non-English users.

For removing multiple spaces, I'd strongly recommend Lombra's solution, though I'd amend it to only bother performing a replacement if there's more than one space:
Code:

text = gsub(text, "%s%s+", " ")
For removing "forbidden" characters, rather than using a table and a bunch of gsub and strchar operations, just use a character class and a single operation:
Code:

text = gsub(text, "[\1\10\32\34#%'\(\)\*\+,\-\./%d:;<>=\[\]^{|}~]", "")
If you still need the table for other purposes, you can build the character class at load-time by iterating over the table, but don't forget to escape "magic characters".

Edit: To remove only standalone numbers, use a frontier pattern:
Code:

text = gsub(text, "%f[%a%d]%d+%f[%A%D]", "")

MunkDev 09-18-15 08:14 AM

Code:

text = gsub(text, "[\1\10\32\34#%'\(\)\*\+,\-\./%d:;<>=\[\]^{|}~]", "")
This doesn't remove any of the characters supplied in the character class.

Phanx 09-18-15 08:19 AM

Oh, oops, I've been writing too much JavaScript at work. Try escaping the characters correctly for Lua. :p

Code:

text = gsub(text, "[\1\10\32\34#%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", "")

MunkDev 09-18-15 08:25 AM

Quote:

Originally Posted by Phanx (Post 311047)
Oh, oops, I've been writing too much JavaScript at work. Try escaping the characters correctly for Lua. :p

Code:

text = gsub(text, "[\1\10\32\34#%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", "")

Yeah, I just edited the pattern to this:
Code:

text = text:gsub("[\1\10\32\34%\\#%%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", space)
... and it seemed to do the trick. :p
One thing missing in the pattern you supplied, is escaping backslash!

One last thing, omitting single letters. At this point, using these three:
Code:

text = text:gsub("[\1\10\32\34\92#%%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", space)
text = text:gsub("%s%s+", space)
text = text:gsub("%f[%a%d]%d+%f[%A%D]", "")

... there will be only words left, apart from single letter variables. They are pretty useless in a dictionary.

MunkDev 09-18-15 10:17 AM

Code:

text = text:gsub("[\1\10\32\34\92#%%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", space)
text = text:gsub("%s%a%s", space)
text = text:gsub("%s%a%s", space)
text = text:gsub("%s%s+", space)

This works, but repeats the operation twice to get rid of the cases where two single letters are separated by only one whitespace, such as..
Lua Code:
  1. -- Original code
  2. function someFunction(t,f) print(t,f) end
  3.  
  4. -- Step 1
  5. function someFunction t f print t f end
  6.  
  7. -- Step 2
  8. function someFunction f print f end
  9.  
  10. -- Step 3
  11. function someFunction  print  end
  12.  
  13. -- End result
  14. function someFunction print end

SDPhantom 09-18-15 01:20 PM

Quote:

Originally Posted by MunkDev (Post 311049)
Code:

text = text:gsub("[\1\10\32\34\92#%%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", space)

You might want to remove \32 because it ends up replacing spaces with itself.
The way Lua handles strings, there's no need to store it in a constant. It's actually using up more resources to do so.



Quote:

Originally Posted by MunkDev (Post 311030)
Code:

local space = strchar(32)

You should never use this function with a static number.
Just use a literal string in the gsub() call as noted previously.

semlar 09-18-15 02:04 PM

It's possible I'm misunderstanding what the goal is here, but why aren't you just matching the word pattern you're searching for?

eg.
Lua Code:
  1. for word in string.gmatch('function someFunction(t,f) print(t,f) end', '[_%a][_%w]+') do print(word) end

Code:

function
someFunction
print
end


MunkDev 09-19-15 02:24 AM

Quote:

Originally Posted by semlar (Post 311054)
It's possible I'm misunderstanding what the goal is here, but why aren't you just matching the word pattern you're searching for?

eg.
Lua Code:
  1. for word in string.gmatch('function someFunction(t,f) print(t,f) end', '[_%a][_%w]+') do print(word) end

This is a very simple and effective approach. Could you please break down how the pattern works?

semlar 09-19-15 01:02 PM

Quote:

Originally Posted by MunkDev (Post 311064)
This is a very simple and effective approach. Could you please break down how the pattern works?

%a represents all letters, %w represents all letters and numbers.

[_%a][_%w]+ says to match an underscore or a letter, followed by at least one underscore or letter or number.

Without defining word boundaries this approach isn't perfect, if your source text contains something like 123InvalidVariable-Name, it will match "InvalidVariable".

Lua doesn't support lookahead/behind, but you may be able to surround your pattern with a character class including whitespace, parentheses, commas, etc. which must come before and after your word to be considered valid.

Here's a potential pattern you can try based on phanx's character class earlier..
Lua Code:
  1. for word in string.gmatch([[function
  2. someFunction(t,f) print(to,fr) _ContainerFrame1_1 end]], '%f[^\0%s\'\"\\/#%%()*+,.:;<>=%[%]{|}~]([_%a][_%w]+)%f[\0%s\'\"\\/#%%()*+,.:;<>=%[%]{|}~]') do print(word) end

SDPhantom 09-20-15 05:48 PM

How about something like this?

Code:

for word in string.gmatch(" "..code,"[ ;]+([_%a][_%w]*) do
This allows single character names by making the second character optional (* defines 0 or more, tries to match as many as possible). This also injects a delimiter before the string to make the matching work correctly. There are also problems that could happen depending on where spaces are put that is valid Lua code, but would mess up this tokenizer.

MunkDev 09-20-15 06:18 PM

I can see how in the interest of writing code, it would seem appropriate to filter out words attached to malformed snippets, but really, this library should aim to be as generally applicable as possible. I will port it for use with other stuff, like a chat addon or taking notes on screen. Might even be useful for my controller addon, considering I haven't built chat functionality for it yet. Besides, my editor debugs code automatically and will throw an error when the user has invalid variable names.



Let me ask this instead, as a programmer using an IDE, would you rather it store "ContainerFrame" and "Item" in suggestions, or "ContainerFrame4Item16", if that was what you originally entered? I can't make up my mind on that point, considering there are some cases where you want to consistently use the same numbered variable and some cases where you want to change the numbers.

semlar 09-20-15 08:19 PM

Quote:

Originally Posted by MunkDev (Post 311098)
Let me ask this instead, as a programmer using an IDE, would you rather it store "ContainerFrame" and "Item" in suggestions, or "ContainerFrame4Item16", if that was what you originally entered?

I think it should only be suggesting full variable names that actually exist within the current document, a partial match would only be confusing.

kurapica.igas 09-20-15 10:07 PM

1 Attachment(s)
Normally, we only need the keywords and identifiers for auto-completation.

I see you are using the kristofer's Indent.lua. I think you can record those words when the lib scan for identifiers(tokentype == tokens.TOKEN_IDENTIFIER, record it) .

Using a dichotomy search can quickly check if the identifier existed or not, or just keep a cache table for checking.

Making an editor is fun, but with wow, it's a pain in the ass.

Attachment 8650

MunkDev 09-23-15 08:34 AM

Quote:

Originally Posted by kurapica.igas (Post 311100)
Normally, we only need the keywords and identifiers for auto-completation.

I see you are using the kristofer's Indent.lua. I think you can record those words when the lib scan for identifiers(tokentype == tokens.TOKEN_IDENTIFIER, record it) .

Using a dichotomy search can quickly check if the identifier existed or not, or just keep a cache table for checking.

Making an editor is fun, but with wow, it's a pain in the ass.

Attachment 8650

Do you have a code snippet for that? I was thinking of thinning out the table by matching each letter, which will of course remove unwanted results with each iteration, but it's by no means a binary algorithm.

kurapica.igas 09-23-15 09:43 AM

Well, I don't use the Indent.lua, my code can be found in CodeEditor

I don't use a tree table to store the keywords, only one sorted table is used, like self.AutoCompleteList contains all identifiers(self is the editor).

I give each match words a weight, that based on the input word, to make sure the first word is the most wanted word.

To create a list based on the input word, the core func in the file is
  • Compare -- Used to compare words without case.
  • CompareWeight -- Used to compare two word's weight that based on the input word.
  • GetIndex -- Use dichotomy query to find the index of the word
  • TransMatchWord -- Create a match pattern based on the input word, 'pt' -> '^([pP])([%w_]-)([tT])([%w_]-)$', this would be used on match words, so weight and color can be applied to those words(check the pic).
  • ApplyColor -- Apply color to the match word.
  • ApplyAutoComplete -- The main func.

The filter job is done in those code :

Lua Code:
  1. local uword = "^" .. word:gsub("[%w_]", TransMatchWord) .. "$"  -- the pattern
  2. local header = word:sub(1, 1)       -- the first char of the match word must be the same to the input word
  3.  
  4. local lst = self.AutoCompleteList    -- the sorted identifiers list based on the Compare func
  5. local sIdx = GetIndex(lst, header) -- get start index based on the first char, where we start checking
  6.  
  7. if sIdx == 0 then sIdx = 1 end
  8.  
  9. for i = sIdx, #lst do
  10.     local value = lst[i]
  11.     if #value == 0 or Compare(header, value:sub(1, 1)) then break end  -- no need to continue
  12.  
  13.     _AutoCheckWord = value  -- Keep word to be used in ApplyColor
  14.  
  15.     if _AutoCheckWord:match(uword) then
  16.         _AutoCheckWord:gsub(uword, ApplyColor)
  17.  
  18.         tinsert(_AutoCacheKeys, _AutoCheckWord)
  19.     end
  20. end
  21.  
  22. -- ...
  23. Array.Sort(_AutoCacheKeys, CompareWeight)  -- Sort match words based on their weight

May it can help you.


All times are GMT -6. The time now is 08:30 AM.

vBulletin © 2024, Jelsoft Enterprises Ltd
© 2004 - 2022 MMOUI