Well if you want to crawl the site I don't think it's that hard.
Lets take the tag `Fire Emblem`. The url looks like this. http://browse.minitokyo.net/gallery?tid=994&index=1
tid seems to be the tag id. 994 is for Fire Emblem.
Change it to 995 and see what happens.
http://browse.minitokyo.net/gallery?tid=995&index=1
Xenosaga
The index=1 looks like Wallpapers. Change it to 2? Indy Art. 3? Scans. Ok, lets stick with index=1, Wallpapers.
What happens when we go to the next page? http://browse.minitokyo.net/gallery?tid=994&index=1&page=2
Does http://browse.minitokyo.net/gallery?tid=994&index=1&page=1
work? Yes. Good.
So the first step is simple, we need to download each page in a loop. Lets use standard Unix shell scripting since it is
available preinstalled on basically any computer except Windows. For Windows you need to explicitly install it. Blame
Microsoft.
Code:
for i in $(seq 1 2); do wget -O - "http://browse.minitokyo.net/gallery?tid=994&index=1&page=$i" >
"page$i" 2> /dev/null; done
Open up your newly downloaded files in a text editor. Or look at the page source in a web browser.
The links that take you through the larger version of the image and comments etc. looks like this http://gallery.minitokyo.net/view/677688. We can grep them out of
our newly downloaded files quite easily.
Code:
$ grep -oP "http://gallery.minitokyo.net/view/\d+" page1
Һttp://gallery.minitokyo.net/view/224620
Һttp://gallery.minitokyo.net/view/224620
Һttp://gallery.minitokyo.net/view/136814
Һttp://gallery.minitokyo.net/view/136814
Һttp://gallery.minitokyo.net/view/93129
...
(Cyrillic Shha instead of H used to avoid Minitokyo's funky hyperlinking functionaility).
You may be forgiven for thinking that the next step is to simply plug these new URLs in wget, look at the HTML for these
new pages and repeat the process above. If you actually do that though you might spot a nifty shortcut.
Lets look at the URL for downloading the full sized image. http://gallery.minitokyo.net/download/224620
Ahha, that number on the end sure looks familiar.
It looks like instead of downloading Һttp://gallery.minitokyo.net/view/224620 then http://gallery.minitokyo.net/download/224620 we can simply go
directly to http://gallery.minitokyo.net/download/224620
To do this we change the grep command above to give us only the number, not the full URL. I don't know if grep
supports matching groups so lets just be lazy and run grep twice. You may notice that each number is printed twice.
Piping into uniq will take care of that.
Code:
$ grep -oP "http://gallery.minitokyo.net/view/(\d+)" page1 | grep -oP "\d+" | uniq
224620
136814
93129
525806
194413
Now we can reuse the loop from earlier to download each page.
Code:
for i in $(seq 1 1); do for j in $(wget -O -
"http://browse.minitokyo.net/gallery?tid=994&index=1&page=$i" 2>/dev/null | grep -oP
"http://gallery.minitokyo.net/view/(\d+)" | grep -oP "\d+" | uniq) ; do wget http://gallery.minitokyo.net/download/$j ; done ; done
This is getting a but unwieldy for a one-liner. It might be time to switch this up as a nicely formatted shell
script.
I was planning to leave things here on the assumption that authentication would be required to download the large size
images. i.e. you either need to login with wget (or Curl might be better in this case), or you need to extract your
current cookies from your webbrowser and give them to wget/ curl.
However, it turns out that the download pages in the last step are also HTML. The actual images themselves are one more
link away. I won't bother going through that here since it is the exact same steps as above.
Also I don't want to get banned. If you decide to go down this road I suggest you at least add delays in between
each download (sleep 30; wget ...) and leave it running overnight.
I hope this is helpful to anyone who is genuinely interested in learning and isn't just looking for a ready made
script (Although this almost is that anyway).