Quantcast website scraping help please? - WoWInterface
Thread Tools Display Modes
10-25-12, 09:41 AM   #1
myrroddin
A Pyroguard Emberseer
 
myrroddin's Avatar
AddOn Author - Click to view addons
Join Date: Oct 2008
Posts: 1,086
website scraping help please?

I was talking to the staff at wowdb.com, and they gave permission for me to scrape their mailbox data. wowhead.com does not track mailboxes, if people wondered.

My problem is that I have no idea how to scrape the data because I have never done any scraping. I looked at DataTools on Curse, and I got headaches.

If someone wants to teach me, I would like to learn how DT makes an addon, populates it with data from wowhead or wowdb or the game's files, etc. On the other hand, if someone wants to present me with "here you go", I'll take that happily as well. The format I want the data is something like the following example, as I will need to make corrections like mailboxes in the wrong location, put icons on the correct floors, etc.

The LocationMapper line for the mailbox link above starts on line 7067, and I understand it doesn't follow the table format I need. If I could even break it down into something readable and then copy/paste, that's fine too.

Lua Code:
  1. -- HandyNotes_PostService.lua
  2. local HPS = LibStub("AceAddon-3.0"):NewAddon("HandyNotes_PostService")
  3.  
  4. do
  5.     function HPS:ParseData()
  6.         local mailboxes = HPS:Data()
  7.         -- blah blah
  8.     end
  9. end
  10.  
  11. function HPS:OnIntialize()
  12.     -- blah
  13. end
  14.  
  15. -- PostService_Mailboxes.lua
  16. local HPS = LibStub("AceAddon-3.0"):GetAddon("HandyNotes_PostService")
  17.  
  18. function HPS:Data()
  19.     local mailboxes = {
  20.         ["StormwindCity"] = { -- Astrolabe's mapFile name, but it could be Blizzard's mapID and I can convert
  21.             [0] = "1|62207440|" -- [mapFloor] = "factionNum|coord|factionNum|coord|" where Alliance = 1, Horde = 2, Other = 3
  22.         }
  23.     }
  24.     return mailboxes
  25. end
  Reply With Quote
10-25-12, 09:42 AM   #2
myrroddin
A Pyroguard Emberseer
 
myrroddin's Avatar
AddOn Author - Click to view addons
Join Date: Oct 2008
Posts: 1,086
If the game client itself keeps track of the locations, that would be awesome, but I don't think it does.
  Reply With Quote
10-25-12, 10:57 AM   #3
Barjack
A Black Drake
AddOn Author - Click to view addons
Join Date: Apr 2009
Posts: 89
I wouldn't really call this a scraping problem, but I guess you could call this sort of data conversion part of a scraping problem. It seems like all the data you need exists in the table passed to that "Mapper" call, all you need to do is convert it to a format you can use in Lua.

If you want to do this in an automated way, you'll probably want to use some sort of scripting language that can load JSON, load that table in there, parse it that way, and output some Lua table you can include in your addon. Perhaps a local install of Lua can do things like this, but I really don't know. I imagine languages like Python, Perl or Ruby would be how most people go about things like that.

I looked at http://static-azeroth.cursecdn.com/1...574/js/core.js to understand what the "pins" array is. Running the Mapper function through a pretty printer shows this:
Code:
	function f(G) {
		var F = G >> 9;
		var H = G & 511;
		return [F / 5, H / 5]
	}
This is what turns a "pin" into a coordinate pair. For example the "pin" 83265 in Ironforge results in x = (83265 >> 9) / 5 = 32.4, and y = (83265 & 511) / 5 = 64.2. That results in a pin at 32.4,64.2 which is correct. You could probably do that part in Lua or in your pre-processing, whichever works best for you.

Also it may be worth noting that if this data isn't something you'll need to convert often, you could probably do some amount of work just with some regular expression find-and-replace on that huge table, instead of loading it as JSON and running the tree, etc. But this might make converting pins more difficult if that is needed in the pre-Lua stage.

As for stuff like how to convert its "floors" and zone names to something easier for you to use, I'm not sure what your options are. There may be Lua libraries or something to help you there, but there may not be either.
  Reply With Quote
10-25-12, 11:35 AM   #4
SDPhantom
A Pyroguard Emberseer
 
SDPhantom's Avatar
AddOn Author - Click to view addons
Join Date: Jul 2006
Posts: 1,876
Originally Posted by myrroddin View Post
If the game client itself keeps track of the locations, that would be awesome, but I don't think it does.
There is a tracking option for mailboxes in the default UI, it adds icons for them on the minimap when you're close to one.
__________________
ESOUI AddOns | WoWInterface AddOns
"All I want is a pretty girl, a decent meal, and the right to shoot lightning at fools."
-Anders (Dragon Age: Origins - Awakening)
  Reply With Quote
10-25-12, 06:23 PM   #5
Phanx
Cat.
 
Phanx's Avatar
AddOn Author - Click to view addons
Join Date: Mar 2006
Posts: 5,617
That's completely useless for the OP's purpose, though, as those (1) only appear when you are in minimap range of the mailbox, and (2) are not accessible by addons in any way.
__________________
Author/maintainer of Grid, PhanxChat, oUF_Phanx, and many more.
Troubleshoot an addonTurn any code into an addonMore addon resources
Need help with your code? Post all of your actual code! Attach or paste your files.
Please don’t PM me about addon bugs or code questions. Post a comment or forum thread instead!
  Reply With Quote
10-26-12, 07:04 AM   #6
myrroddin
A Pyroguard Emberseer
 
myrroddin's Avatar
AddOn Author - Click to view addons
Join Date: Oct 2008
Posts: 1,086
Worst case I can perform data entry by mousing over icons per zone and floor. It will take some time, but it will work. I was just hoping to learn if someone was willing and able to teach.
  Reply With Quote
10-26-12, 05:33 PM   #7
Phanx
Cat.
 
Phanx's Avatar
AddOn Author - Click to view addons
Join Date: Mar 2006
Posts: 5,617
Originally Posted by Barjack View Post
It seems like all the data you need exists in the table passed to that "Mapper" call, all you need to do is convert it to a format you can use in Lua. ... Perhaps a local install of Lua can do things like this ...
Definitely. If someone can post the table, I can convert it for you or give you a Lua script to convert it. However, I've never done anything remotely related to JSON or website scraping, so I have no idea how to obtain said table.
__________________
Author/maintainer of Grid, PhanxChat, oUF_Phanx, and many more.
Troubleshoot an addonTurn any code into an addonMore addon resources
Need help with your code? Post all of your actual code! Attach or paste your files.
Please don’t PM me about addon bugs or code questions. Post a comment or forum thread instead!
  Reply With Quote
10-27-12, 12:52 AM   #8
Saiket
A Chromatic Dragonspawn
 
Saiket's Avatar
AddOn Author - Click to view addons
Join Date: Jul 2008
Posts: 154
I've done a lot of similar parsing for _NPCScan.Overlay, so I rearranged my Python 3 scripts to pull your mailbox locations into a Lua source file. I left out my MPQ parsing code since it requires you to build a DLL, so this version reads DBC files directly. The attached zip contains:
  • WorldMapArea.dbc extracted for you; May need to get the latest version from your data files if new maps with mailboxes get added in a patch.
  • mailboxes.py - The actual scraping script.
  • dbc.py - A simple module for reading DBC files.
  • mailboxes.bat - Windows batch file to run the above script with default parameters.
  • mailboxes.lua - The sample output file my run created.

Note that you'll need Python 3.2+ to run the scripts, and you must install BeautifulSoup4 to interpret WoWDB's HTML.

Here's how it works in summary:
  1. Download the raw HTML for object 142075 (mailbox) from WoWDB.
  2. Interpret the text as HTML using a forgiving XML parser.
  3. Search the resulting document tree for a div with ID "mapper-container" with BeautifulSoup.
  4. The script tag following that div contains JavaScript defining map points. Strip off the "Mapper" constructor call with a regex, and parse the contained argument table as JSON.
  5. The table's contents are pretty straight-forward, but maps are represented by their AreaTable IDs (no WoW API exposes these) instead of their MapArea IDs (what you get from GetCurrentMapAreaID). This is where WorldMapArea.dbc comes in to convert to IDs you can use within WoW.
  6. Write it.

Feel free to ask any questions about the script if it interests you. If not though, I think the included Lua source should be good enough to use.
Attached Files
File Type: zip mailboxes.zip (14.2 KB, 352 views)

Last edited by Saiket : 10-27-12 at 02:52 PM. Reason: Added attachment, and actually *uploaded* it this time.
  Reply With Quote
10-27-12, 09:58 AM   #9
myrroddin
A Pyroguard Emberseer
 
myrroddin's Avatar
AddOn Author - Click to view addons
Join Date: Oct 2008
Posts: 1,086
Thank you. I'm going to take a poke at this. Mailboxes are a first step; I want to eventually parse out NPCs who repair, train classes, train skills, and vend certain things. But I need to learn how to parse the data first.
  Reply With Quote
10-27-12, 10:17 AM   #10
myrroddin
A Pyroguard Emberseer
 
myrroddin's Avatar
AddOn Author - Click to view addons
Join Date: Oct 2008
Posts: 1,086
Wait... "Attached zip".... did I miss something, because I don't see one on your post.
  Reply With Quote
10-27-12, 11:37 AM   #11
Seerah
Fishing Trainer
 
Seerah's Avatar
WoWInterface Super Mod
Featured
Join Date: Oct 2006
Posts: 10,691
I think he just forgot it.
__________________
"You'd be surprised how many people violate this simple principle every day of their lives and try to fit square pegs into round holes, ignoring the clear reality that Things Are As They Are." -Benjamin Hoff, The Tao of Pooh

  Reply With Quote
10-27-12, 02:53 PM   #12
Saiket
A Chromatic Dragonspawn
 
Saiket's Avatar
AddOn Author - Click to view addons
Join Date: Jul 2008
Posts: 154
Oops, I had added it to the attachment manager window, but forgot to hit the "upload" button. I've added it to my original post in an edit.
  Reply With Quote
10-27-12, 09:07 PM   #13
myrroddin
A Pyroguard Emberseer
 
myrroddin's Avatar
AddOn Author - Click to view addons
Join Date: Oct 2008
Posts: 1,086
That looks.... AWESOME!! As you are correct, WoWDB does not break mailboxes into factions, which means I would have to copy and edit, saving one as the "original backup for updating". No big deal there.

I have Python 3.3.x 64 bit installed, but even after reading the page's instructions, I could not figure out how to install BeautifulSoup. Also, when I want to parse an update, do I run the batch file, or mailboxes.py? I am guessing the batch file, as its code looks like it creates the latter file.

As for floors, I noticed that Dalaran lists 1 and 2, Ogrimmar is 0 and 1, and the Shrine of Two Moons is 1 and 2 but Shrine of Seven Stars is 3 and 4. Is that a parse issue, or does the game return those values for those zones' floors? Just curious if I need to edit those, or if they are correct, yet odd.

Two last questions for now: how do/should I look at WorldMapArea.dbc, and if the game client does not save mailbox locations in its cache, what is this file used for?
  Reply With Quote
10-27-12, 11:14 PM   #14
Vlad
A Molten Giant
 
Vlad's Avatar
AddOn Author - Click to view addons
Join Date: Dec 2005
Posts: 792
Originally Posted by myrroddin View Post
As for floors, I noticed that Dalaran lists 1 and 2, Ogrimmar is 0 and 1, and the Shrine of Two Moons is 1 and 2 but Shrine of Seven Stars is 3 and 4. Is that a parse issue, or does the game return those values for those zones' floors? Just curious if I need to edit those, or if they are correct, yet odd.
That's because the zone has 4 floors, 2 for horde, 2 for alliance, the 0 one is the regular map. They just used the same areaID the zone and the floors for the capitals rather than having more areaID, just for the city floors.
__________________
Profile: Curse | Wowhead
  Reply With Quote
10-28-12, 02:23 AM   #15
myrroddin
A Pyroguard Emberseer
 
myrroddin's Avatar
AddOn Author - Click to view addons
Join Date: Oct 2008
Posts: 1,086
That would make sense if the Shrines did not have different mapIDs, but they do. I guess if I want accurate data, I would skip mapID[811] Vale of Eternal Blossoms and stick to mapID[903] Shrine of Two Moons and mapID[905] Shrine of Seven Stars. When plugging data into Astrolabe and HandyNotes, [903][1] and [903][2] are correct for Moon's floors, while [905][3] and [905][4] are correct for Stars'? Or does Astrolabe use floors [1][2] for Stars'? I will have to test I suppose.

The reason I'm asking is because right now, the HandyNotes plugins for Innkeepers, vendors, trainers, bankers, etc use [811] as their mapID, which is not correct for either city, and it messes up the zone map and each of the city map floors. The icons are all in weird places, and I want to avoid that if possible.

Hey, it occurs to me to wonder, why is there eight coordinates rather than six? More accurate, yes, but if I wanted to use user data for missing mailboxes, all the coordinate addons I've seen read as 66.5, 47.2 and not 66.57, 47.21. To further give me questions, the example for GetPlayerMapPosition() uses even longer numbers, and between 0 and 1 at that.
  Reply With Quote
10-28-12, 07:47 AM   #16
Vlad
A Molten Giant
 
Vlad's Avatar
AddOn Author - Click to view addons
Join Date: Dec 2005
Posts: 792
Actually I assumed that the horde city areaID was the same as the map, because that is the case for the alliance city, hehe.

Regarding coordinate precision, most have one decimal because it's close enough, but you should use two decimals if you want to be precise, it just takes more space to store that extra digit.
__________________
Profile: Curse | Wowhead
  Reply With Quote

WoWInterface » Developer Discussions » General Authoring Discussion » website scraping help please?

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off