View Single Post
09-17-15, 09:21 PM   #1
MunkDev
A Scalebane Royal Guard
 
MunkDev's Avatar
AddOn Author - Click to view addons
Join Date: Mar 2015
Posts: 431
Pattern matching optimization

My understanding of pattern matching and string manipulation in lua is limited at best. I'm currently developing a library that will use a tree structured dictionary to look up possible suggestions as the user inputs text into an EditBox.

The dictionary will be split into two parts; one static, default dictionary with pre-defined entries and one dynamic dictionary which uses the current text of the EditBox to further add suggestions.

My question is regarding how I can strip a lengthy piece of text of all characters that are not words in the best way. My current approach is using a table of "forbidden" characters which is then used to replace each occurrence with whitespace. After that, all multiple occurences of whitespace are replaced by single whitespaces before splitting the string into table entries. This results in a lot of repeats in the returned substrings because syntax is rarely used just once. Can this be done more efficiently?

This is what the code looks like:
Lua Code:
  1. -- Byte table with forbidden characters
  2. local splitByte = {
  3.      [1]   = true, -- no idea
  4.      [10]  = true, -- newline
  5.      [32]  = true, -- space
  6.      [34]  = true, -- ""
  7.      [35]  = true, -- #
  8.      [37]  = true, -- %
  9.      [39]  = true, -- '
  10.      [40]  = true, -- (
  11.      [41]  = true, -- )
  12.      [42]  = true, -- *
  13.      [43]  = true, -- +
  14.      [44]  = true, -- ,
  15.      [45]  = true, -- -
  16.      [46]  = true, -- .
  17.      [47]  = true, -- /
  18.      [48]  = true, -- 0
  19.      [49]  = true, -- 1
  20.      [50]  = true, -- 2
  21.      [51]  = true, -- 3
  22.      [52]  = true, -- 4
  23.      [53]  = true, -- 5
  24.      [54]  = true, -- 6
  25.      [55]  = true, -- 7
  26.      [56]  = true, -- 8
  27.      [57]  = true, -- 9
  28.      [58]  = true, -- :
  29.      [59]  = true, -- ;
  30.      [60]  = true, -- <
  31.      [62]  = true, -- >
  32.      [61]  = true, -- =
  33.      [91]  = true, -- [
  34.      [93]  = true, -- ]
  35.      [94]  = true, -- ^
  36.      [123] = true, -- {
  37.      [124] = true, -- |
  38.      [125] = true, -- }
  39.      [126] = true, -- ~
  40. }
  41.  
  42. local n = CodeMonkeyNotepad -- the editbox
  43. local text = n:GetText() -- get the full text string
  44. local space = strchar(32)
  45.  
  46. -- Replace with space
  47. for k, v in pairsByKeys(splitByte) do
  48.      -- treat numbers differently
  49.      if k < 48 or k > 57 then
  50.           text = text:gsub("%"..strchar(k), space)
  51.      else
  52.           text = text:gsub(strchar(k), space)
  53.      end
  54. end
  55.  
  56. -- Remove multiple spaces
  57. for i=10, 2, -1 do
  58.      text = text:gsub(strrep(space, i), space)
  59. end
  60.  
  61. -- Collect words in table
  62. local words = {}
  63. for k, v in pairsByKeys({strsplit(space, text)}) do
  64.      -- ignore single letters
  65.      if v:len() > 1 then
  66.           words[v] = true
  67.      end
  68. end
Here's an example output, using the actual code as text inside the EditBox:

Note: pairsByKeys is just a custom table iterator that sorts by key.
__________________

Last edited by MunkDev : 09-17-15 at 09:27 PM.
  Reply With Quote