Data Compression with Bitfields

Elstarin · 10-30-15, 12:12 PM

I've been working on a large addon project for a long time. My addon gathers lots of data about the user's performance in combat and tries to convert it into a simple format to quickly tell the user what parts of combat they are doing well, and what they need to improve on. A big part of how it displays data is with graphs. My function that draws the graphs is pretty good, it's efficient, it can easily handle lots of graphs with thousands, or tens of thousands, of data points each.

However, it's pretty important for my addon to be able to store lots of combat sessions in the saved variables for later viewing, like for comparisons and stuff. I'm worried about how big the saved variables can get with storing all the graph points for lots of fights.

Each of my graphs has an array, containing the X and Y plot points, but the X points can be safely discarded before saving as they are taken at specific intervals, so I can rebuild them later. That leaves me with lots of arrays of Y points, like this:

Lua Code:

graph.data = {
   [1] = 87.93,
   [2] = 38.2,
   [3] = 54.76,
   [4] = 76.3,
   [5] = 65,
   [6] = 34.7,
   ...
}

These can be quite long, easily over 1,000 entries, sometimes over 10,000 per graph, and there are a lot of graphs per fight, and a lot of fights stored. So, to finally get to the point, I'm wondering if I can compress this data down before saving it using bitfields, and then uncompress it when I also rebuild the X points when the graph is loaded.

Unfortunately, I'm finding bitfields to be incredibly confusing. I'd share my code for them, but I actually haven't managed to get anything even remotely close to working yet, so there's nothing to show.

Is this practical? Is it even possible? I've been unable to find any examples of this type of thing online, which makes me think that either it's really obvious and I'm being dumb, or it just isn't a good idea, and I'm being dumb. If this can work, can anyone point me in the right direction of how to get going with this?

Thanks!

Dridzt · 10-30-15, 12:40 PM

Have you found, ran any tests on your data with LibCompress?

Discussion thread

Elstarin · 10-31-15, 12:07 PM

Hmm, somehow I managed to miss that. I've done some testing, and it does seem to compress it down quite a bit. Here's what I was doing for testing:

Lua Code:

local data = {}
  
for i = 1, 10000 do
  local y = random(1, 10000) / 100 --  Fake having decimals, just for testing
  data[i] = floor(y * 100)
end
  
local compressed = libC:Compress(table.concat(data, "~"))
local decompressed = libC:Decompress(compressed)
  
wipe(data)
  
local num = 0
for y in decompressed:gmatch("(%d+)~?") do
  num = num + 1
  data[num] = y / 100
end
 
print(#compressed, #decompressed)

This seems to compress it pretty well. Generally it was compressing a string of 10k table entries from around a length of 49k to a length of 21k, which is about 42% of the original. If you're wondering what all the dividing and multiplying by 100 is, I'm just removing the period for the decimals, because it just wastes space.

I haven't done any profiling tests, but I'm guessing this method is pretty inefficient. Of course, in practice, these are two different functions, one for compressing and saving the data, and one for decompressing and loading it, so they don't happen at the same time, unlike my test. However, I would certainly like to be able to cut out the necessity of doing it in strings, if possible. I know very little about bitfields, but as far as I can tell, it should be possible to skip the string conversion.

Thanks for pointing out the library, Dridzt, this gives me much more saved variable space to work with. If anyone has any ways to improve my method with LibCompress, please let me know. Also if anyone knows of a relatively simple way to compress the data more directly, I would absolutely love to hear it, but this is still good.

EDIT: I did a few basic efficiency tests, and on my computer, which is moderately good, it takes about 1.1 seconds per 100k table entries to compress and decompress. The compression portion seems to be about 70% of that time. I'll probably want to try and do the compression steadily in chunks to avoid causing any FPS drop.

semlar · 10-31-15, 12:47 PM

I would personally just interpret the data and generate a summary of it, as it's unlikely the user needs to keep thousands of separate data points for a graph.

If you truly need to keep all of that information and there's no way to simplify it, I'm not sure compressing it is going to be worth the trade-off of slowing access time to save space, which isn't particularly limited.

Elstarin · 10-31-15, 01:43 PM

Obviously, this is a very contrived example, but were you thinking something more or less along these lines?

Lua Code:

local newData = {}
for i = 1, #data do
  -- Mathy stuff here to get the angle of this point relative to data[i - 1] and data[i + 1]
  -- Figure out if the current data entry can be safely discarded or not
  -- If angle is basically the same, like it would be between the points 20, 30, and 40,
  -- then skip adding the one for 30, it isn't necessary
  
  if pointIsNecessary then
    newData[#newData + 1] = data[i]
  end
end
 
data = newData

I had thought about this before, and decided it wouldn't work in my system, but since then I have changed how the graph data gets read and I probably can do this with minimal rewriting. I just never came back to this idea...

Oh, and to be clear, I know nobody can help me actually write the calculations to determine if the point is necessary or not, because I haven't shown any of my graph's code. I'm not worried about that part, I'm sure I can figure it all out, and I don't like asking people to spend their time on my projects. I just want to make sure this is actually the type of thing you mean, Semlar.

Anyway, I guess I was completely on the wrong track when thinking about using a compression system. It seems kind of obvious now that Semlar pointed it out... Oh well. Thanks for the help!

semlar · 10-31-15, 02:03 PM

If you don't need to show frequent spikes in your graph you can use a smoothing algorithm to reduce the noise and show something closer to a moving average.

What kind of filtering you use depends on the type of data and how close your graph should be to the original values.

You can store the initial points the same way you're doing it now, and then apply your filter on the whole set of data when it's complete to reduce what you need to keep, rather than try and determine whether you should add a point based on the previous entry since it really needs to be processed as a whole.

You might take a look at something like the Ramer–Douglas–Peucker algorithm if you want to keep the same general shape of the graph while reducing the number of points required.

Elstarin · 10-31-15, 04:13 PM

I do use a type of smoothing when I actually draw the graph so that each visible graph only uses a maximum of around 300 - 500 textures even if it's drawing from a table of thousands of points. However, my method of ignoring points in that is fairly arbitrary, and I'm not happy with it.

I had never heard of that algorithm from the wiki, but it looks really interesting. I think I'll be able to apply it to my data table when I am saving the graph without too much trouble. Making it work with a graph that's currently active may be slightly more complicated, but I'm pretty sure I'll be able to work that out.

I'm not familiar with the language used in the code example for the algorithm (it's Java, right?), but I found a Lua version of it, so I'll just try to mold that into my code.

Thanks for the help! If I can make this work well, it will make my graphs dramatically more efficient and the storage space won't be an issue at all.

Elstarin · 11-01-15, 02:36 PM

After playing around with it for a while, here's what I have come up with:

Lua Code:

local badPoints = {}
local newData = {}
local startPoint, stopPoint
local function smoothingAlgorithm(data, first, last, tolerance, callback)  
  if not callback then -- True the first time it is called, not when it calls itself
    wipe(badPoints)
    
    startPoint, stopPoint = first, last -- When it matches these values again, it *should* be done running
  end
  
  local maxD = 0
  local farthestIndex = 0
 
  for i = first, last do
    local x1, y1 = data[i], data[-i]
    local x2, y2 = data[first], data[-first]
    local x3, y3 = data[last], data[-last]
    
    local area = abs(0.5 * (x2 * y3 + x3 * y1 + x1 * y2 - x3 * y2 - x1 * y3 - x2 * y1)) -- Get area of triangle
    local bottom = sqrt((x2 - x3) ^ 2 + (y2 - y3) ^ 2) -- Calculates the length of the bottom edge
    local distance = area / bottom -- This is the triangle's height, which is also the distance found
    
    if distance > maxD then
      maxD = distance
      farthestIndex = i
    end
  end
 
  if maxD > tolerance and farthestIndex ~= 1 then
    badPoints[farthestIndex] = true -- Flagged for removal
 
    smoothingAlgorithm(data, first, farthestIndex, tolerance, true)
    smoothingAlgorithm(data, farthestIndex, last, tolerance, true)
  end
  
  if first == startPoint and last == stopPoint then -- Should mean it's done running
    wipe(newData)
    
    for i = startPoint, stopPoint do
      if not badPoints[i] then -- The point was not flagged to be removed
        local num = #newData + 1
        newData[num] = data[i]
        newData[-num] = data[-i]
      end
    end
    
    for i = startPoint, stopPoint do
      if newData[i] then -- Reconstruct the data table
        data[i] = newData[i]
        data[-i] = newData[-i]
      elseif data[i] then -- I assume this is cheaper than calling wipe(data) first, since I'm already doing the loop anyway
        data[i] = nil
        data[-i] = nil
      end
    end
  end
end

In my graph data tables I store the X values as positive, and the Y as negative.

This function is moderately expensive to run, but I did a bunch of profiling tests and it wasn't terrible. It shouldn't cause any FPS loss, as long as I'm careful. I feel like there must be a much better way to handle the tables at the end, when it's done running. I hate looping through it twice, but at the moment I can't see a better way to do it while keeping my data table indexed, without making a throwaway table. A lot of my addon is currently built around the data table being indexed, and it would be a big pain to change that.

I haven't done a ton of practical testing of it yet, but from what I have seen, it seems to be doing a good job. My current problem is that I cannot find any way to call it gradually while a graph is active (new data points are being added), but for now I can keep using my old smoothing method on active graphs and use this when I save them. It isn't ideal, but I doubt any users would notice the difference anyway...

I'm always happy to have the flaws in my code pointed out, so if anyone sees a problem with it or a way it could be improved, I'd love to hear it.

The source code that this is more or less based on can be found here: https://quangnle.wordpress.com/2012/...ts-of-a-curve/