wget'ing complete web pages/sites

- D
- Don Y
  
  Contact options for registered users
posted
10 years ago

Mon, Nov 18, 2013 6:41 AM

Hi,

I've been downloading driver sets for various laptops, etc. from manufacturer web sites. This is a royal PITA. Why isn't there a "download all" button? (or, an ftp server with directories per product that could be pulled down in one shot!)

Anyway, I have tried a couple of utilities that claim to be able to do this -- with little success. With no first-hand experience *building* web pages/sites, I can only guess as to what the problem is:

Instead of static links on the page, everything hides behind JS (?). And, the tools I have used aren't clever enough to know how to push the buttons?

[or, they don't/can't capture the cookie that must exist to tell the site *what* I want (product, os, language)]

Is there a workaround for this? It will be an ongoing effort for me for several different models of PC and laptop so I'd love to have a shortcut -- ordering restore CD's means a long lag between when I get the machines and when I can finish setting them up. I'd like NOT to be warehousing stuff for a nonprofit! (SWMBO won't take kindly to that!)

Thx,

--don

- M
- miso
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Nov 18, 2013 9:26 AM

If wget doesn't work, you will need to write your own crawler. Some websites are complicated enough that wget won't work. Wget isn't very good for dynamic data.

Find someone competent in beautifulsoup or learn it yourself.

- J
- Jeff Liebermann
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Nov 18, 2013 4:01 PM

Because if they did that, there would only be one opportunity to sell you something via the ads. By forcing you to come back repeatedly to the same page, there are more opportunities to sell you something. Even if they have nothing to sell, the web designers might want to turn downloading into an ordeal process so that the "click" count is dramatically increased.

Sorry, I don't have a solution. Javascript and dynamic web content derived from an SQL database are not going to work. CMS (content management system) are also difficult to bypass. I use WinHTTrack instead of wget for downloading sites. It tries hard, but usually fails on CMS built sites: See the FAQ under Troubleshooting for clues: However, even if you find the manufacturers secret stash of drivers, they usually have cryptic names that defy easy identification. I once did this successfully, and then spent months trying to identify what I had just accumulated.

If the manufacturer has an FTP site, you might try snooping around the public section to see if the driver files are available via ftp.

--
Jeff Liebermann     jeffl@cruzio.com 
150 Felker St #D    http://www.LearnByDestroying.com 
Santa Cruz CA 95060 http://802.11junk.com 
Skype: JeffLiebermann     AE6KS    831-336-2558

- D
- Don Kuenz
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Nov 18, 2013 5:04 PM

FWIW I've gotten a fair bit of mileage out of Dell's hybrid HTTP/FTP site:

formatting link

OT - it surprised me to see Microsoft's FTP site still open after all these decades: ftp://ftp.microsoft.com/

--
Don Kuenz

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Nov 18, 2013 8:15 PM

The implication is that this is an exercise I would have to repeat for each manufacturer's site? :<

I think one of the tools I have will let me browse-and-save (at least that saves a bunch of keyclicks for each download!)

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Nov 18, 2013 8:26 PM

A wee bit cynical, eh Jeff? :>

[Actually, I suspect the problem is that such a Big Button would end up causing folks to take the lazy/safe way out -- too often. And, their servers see a bigger load than "necessary".

Besides, most vendors see no cost/value to *your* time! :-/ ]

That's what I figured when I took the time to look at the page's source. :<

That was the first option I tried. It pulled down all the "fluff" that I would have ignored -- and skipped over the meat and potatoes!

In the past, I've just invoked each and cut-pasted some banner that the executable displays in as the new file name (put old in parens).

HP's site is particularly annoying. E.g., all the *documents* that you download are named "Download.pdf" (what idiot thought that would be a reasonable solution? Download two or more documents and you have a name conflict! :< )

Looks like I'll go the browse-and-save route I described in my other reply :-/

Yes, but those sites tend to not present the "cover material" that goes with each download. Release notes, versioning info, prerequisites, etc. So, you have no context for the files...

Maybe I'll "take the survey" -- not that it will do any good! ("Buy our driver CD!" "Yes, but will it have the *latest* drivers for the OS I want? And, all the accompanying text? And, do I get a discount if I purchase them for 15 different models??")

- M
- miso
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Nov 18, 2013 9:40 PM

Yes, the software has to be tweeked per site. Somebody good at beautifulsoap can crank it out quickly. The code can be a few dozen lines, but if you don't know the fu, it is a monumental task.

- G
- George Neuner
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Nov 18, 2013 11:01 PM

I use HTTrack when I need to clone (parts of) sites. It's available for both Windows and Linux.

formatting link

I don't know exactly how smart it is, but I've seen it reproduce some pretty complicated pages. It can scan scripts and it will grab files from any links it can find. It's also pretty well configurable.

George

- M
- miso
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Nov 19, 2013 4:02 AM

Here is an example of a BS program, well actually python using BS. It is

14 line program to find every url in a website.

When you scrape, you don't want to see the website as it is presented to the human viewing the browser, you just want the goodies. That is why the program is somewhat tweaked to the website, once you figure out how they store the data.

- T
- Tauno Voipio
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Nov 19, 2013 7:00 AM

The OP has already lost the game. It is obvoius that the owner of the website does not want automatic vacuuming of the data.

If there were a script-passing downloader, the website owners will resort to CAPTCHAs, which are intended to thwart automats.

--

Tauno Voipio

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Nov 19, 2013 7:29 AM

I think the problem is that the page is "built" on the fly.

I don't think that's the case.

Imagine you were producing some number of PC's that support some number of OS's in some number of languages with some number of updates to some number of subsystems...

Undoubtedly, you would store this information in a configuration management system/DBMS somewhere. It would allow you to "mechanically" indicate the interrelationships between updates, etc.

An update would have a specific applicability, release notes, etc.

Why create hundreds (more like tens of thousands when you consider each download has its own descriptive page -- or 5) of wb pages when you can, instead, create a template that you fill in based on the user's choices? (model, OS, language)

Issue a bunch of specific queries to the DBMS/CMS, sort the results (by driver category) for the given (OS, language, model) and stuff the results into a specific form that you repeat on every page!

I.e., a driver that handles 4 particular languages would magically appear on the pages for those four languages -- and no others. To be replaced by something else as appropriate on the remaining pages.

Sure, you could generate a static page from all this and present

*that* to the user. But, why bother? What do *you* gain?

- R
- R.Wieser
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Nov 19, 2013 8:29 AM

Hello Don,

You could take a look at a program called AutoIt, though I'm not sure its available on all platforms (have been using it on Windows).

Although it origionated as a simple "record and replay" tool it has become quite versatile, enabeling you to script mouse-clicks dependant on data you read from webpages (as long as the browser has an interface to do so ofcourse).

Hope that helps, Rudy Wieser

-- Origional message: Don Y schreef in berichtnieuws l6ccq5$1no$ snipped-for-privacy@speranza.aioe.org...

- G
- George Neuner
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Nov 19, 2013 5:25 PM

Not necessarily.

Unless there are protected directories, HTTrack can grab everything: HTML, scripts, linked files/resources ... everything. It will even work cross-site [though not by default].

Sometimes you have to do a bit of work figuring out the site structure before you can configure HTTrack to clone it. More often than not, the difficulty with HTTrack is that it grabs *more* than you want.

George

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Nov 19, 2013 8:23 PM

What I am *most* interested in is the "cover page" for each download. It's the simplest way to map some bogus file name (x1234h66.exe) to a description of the file ("nVidia graphic driver for FooTastic 123, version 978 11/15/2013"). Too mch hassle to type or cut/paste this sort of stuff, otherwise!

(and file headers often aren't consistent about presenting these "annotations" -- so, you end up having to *invoke* each file, later, to figure out what it is supposed to do...)

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Wed, Nov 20, 2013 12:09 AM

The problem is finding a "list" of pertinent downloads (along with descriptions instead of just file names) for a specific product/os/language.

I.e., "these are ALL the files you will (eventually) need to download if you are building this product to run this os in that language. Then, ideally, fetch them all for you!

For an FP directory PER PRODUCT/OS/LANGUAGE, this would be easy; just copy the entire directory over! (IE sems to be able to do this easily -- along with many other products -- Firefox seems to insist on "file at a time")

E.g., Ages ago, a web page would list individual files and have static links to the files, their descriptions, release notes, etc. So, you could point a tool at such a page and say, "resolve every link on this page and get me whatever is on the other end!

This doesn't appear to be the case, anymore.

E.g., MS has all (most!) of their "updates" individually accessible, with documentation, WITHOUT going through the update service. Nothing "secret", there.

But, finding them all listed in one place (so you could "get all") is a different story!

I had found a site that had done this -- static links to each update/knowledge base article. All pointing to PUBLIC urls on MS's servers. I figured this would be an excellent asset to use to pull down ALL the updates to, e.g., XP before it goes dark!

Unfortunately, appears the feds didn' like the site! Nothing that *I* can see wrong with the page I was looking at (all LINKS and all to MS, not some pirate site). But, I have no idea what else may have been on the site; or, hosted by the same *server!

I'll now see if I have a local copy of the page and see if I can trick HTTrack into fetching them from a File: URL!

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Wed, Nov 20, 2013 12:30 AM

Possibly "per page" -- if there is any inconsistency from page (product) to page (product)?

Not a viable option, then. Seems like the easies would be a tool that you can use in "follow me" mode -- *point* to stuff and let it worry about the downloads.

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Wed, Nov 20, 2013 7:39 PM

But, that really isn't much better than just grep(1)-ing the HTML.

And, you need to be smart enough NOT to pull down the same content multiple times! E.g., if there are multiple links to a 200MB file, you only want *one* copy of it.

Apparently, the sites I've been hitting recently synthesize the URL;s dynamically. Anything wanting to scrape the page would have to invoke the JS methods for each "button", dropdown, etc.

I think that would be asking a lot from a tool. And, probably result in *getting* more than you really want from the site! (do I *really* want every language variant of every OS supported on a particular product?? :< )

HTTrack does a reasonably good job with "traditional" pages (though getting the "depth" right is tricky)

An amusing trick -- after the fact -- is to right-click the "object" in the "Downloads" list (i.e., after or during the download) and select "Copy URL". This allows you to examine the URL that was ultimately invoked/transfered.

(Speaking in terms of Firefox, here)