wget'ing complete web pages/sites

- D
- Don Y
  
  Contact options for registered users
posted
10 years ago

Mon, Nov 18, 2013 6:41 AM

Hi,

I've been downloading driver sets for various laptops, etc. from manufacturer web sites. This is a royal PITA. Why isn't there a "download all" button? (or, an ftp server with directories per product that could be pulled down in one shot!)

Anyway, I have tried a couple of utilities that claim to be able to do this -- with little success. With no first-hand experience *building* web pages/sites, I can only guess as to what the problem is:

Instead of static links on the page, everything hides behind JS (?). And, the tools I have used aren't clever enough to know how to push the buttons?

[or, they don't/can't capture the cookie that must exist to tell the site *what* I want (product, os, language)]

Is there a workaround for this? It will be an ongoing effort for me for several different models of PC and laptop so I'd love to have a shortcut -- ordering restore CD's means a long lag between when I get the machines and when I can finish setting them up. I'd like NOT to be warehousing stuff for a nonprofit! (SWMBO won't take kindly to that!)

Thx,

--don

- E
- edward.ming.lee
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Nov 18, 2013 1:42 PM

So, have you tried "wget" on Linux? I know it would wget recursively with all links and get all html and graphics. It might not get pdf by default, but there might be an option to do so. If not, you have to write your own filter to parse all the pdf links and download them, with "lynx", for examp le.

I have done that long time ago, but not recently.

- J
- Jeff Liebermann
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Nov 18, 2013 4:01 PM

Because if they did that, there would only be one opportunity to sell you something via the ads. By forcing you to come back repeatedly to the same page, there are more opportunities to sell you something. Even if they have nothing to sell, the web designers might want to turn downloading into an ordeal process so that the "click" count is dramatically increased.

Sorry, I don't have a solution. Javascript and dynamic web content derived from an SQL database are not going to work. CMS (content management system) are also difficult to bypass. I use WinHTTrack instead of wget for downloading sites. It tries hard, but usually fails on CMS built sites: See the FAQ under Troubleshooting for clues: However, even if you find the manufacturers secret stash of drivers, they usually have cryptic names that defy easy identification. I once did this successfully, and then spent months trying to identify what I had just accumulated.

If the manufacturer has an FTP site, you might try snooping around the public section to see if the driver files are available via ftp.

--
Jeff Liebermann     jeffl@cruzio.com 
150 Felker St #D    http://www.LearnByDestroying.com 
Santa Cruz CA 95060 http://802.11junk.com 
Skype: JeffLiebermann     AE6KS    831-336-2558

- T
- Tom Gardner
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Nov 18, 2013 4:32 PM

Because then it would be more difficult to sell your attention to the websites that pay for you to look at the content. And they couldn't sell you multiple times to different advertisers.

Because they don't want leech sited downloading and re-publishing their content and getting the ad clicks.

Because there is a /whole ecosystem/ devoted to knowing who has clicked what. Install "Ghostery" and be amazed at up to 20 trackers per page.

It is an arms race. See "Contact" by Carl Sagan for a prediction although, since that was written pre-WWW, it refers to TV advertising.

- D
- Don Kuenz
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Nov 18, 2013 5:04 PM

FWIW I've gotten a fair bit of mileage out of Dell's hybrid HTTP/FTP site:

formatting link

OT - it surprised me to see Microsoft's FTP site still open after all these decades: ftp://ftp.microsoft.com/

--
Don Kuenz

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Nov 18, 2013 8:26 PM

A wee bit cynical, eh Jeff? :>

[Actually, I suspect the problem is that such a Big Button would end up causing folks to take the lazy/safe way out -- too often. And, their servers see a bigger load than "necessary".

Besides, most vendors see no cost/value to *your* time! :-/ ]

That's what I figured when I took the time to look at the page's source. :<

That was the first option I tried. It pulled down all the "fluff" that I would have ignored -- and skipped over the meat and potatoes!

In the past, I've just invoked each and cut-pasted some banner that the executable displays in as the new file name (put old in parens).

HP's site is particularly annoying. E.g., all the *documents* that you download are named "Download.pdf" (what idiot thought that would be a reasonable solution? Download two or more documents and you have a name conflict! :< )

Looks like I'll go the browse-and-save route I described in my other reply :-/

Yes, but those sites tend to not present the "cover material" that goes with each download. Release notes, versioning info, prerequisites, etc. So, you have no context for the files...

Maybe I'll "take the survey" -- not that it will do any good! ("Buy our driver CD!" "Yes, but will it have the *latest* drivers for the OS I want? And, all the accompanying text? And, do I get a discount if I purchase them for 15 different models??")

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Nov 18, 2013 9:57 PM

Check the subject line of the post? :>

No, the problem is the ABSENCE of links on the page! Everything is done in script, URLs are assembled on the fly and "resolved" when you click on a button, etc.

wget et al. never *see* any URLs in the web pages (other than the few static ones like "privacy policy", "terms of use", etc.)

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Nov 18, 2013 10:02 PM

Yet another cynic! :>

But the leech sites are exactly the ones who would invest in ways to get around this! So, you inconvenience your genuine customers by doing something that others with financial incentive will gladly circumvent? :<

As I said elsewhere, I think the real reason is they don't want folks downloading "ALL" -- because that's the easiest thing to do. Bandwidth hit.

And, they probably figure they offer restore disks at $10 so why bother with all these downloads? How many different machines do you have, typically?

And, if you are MAINTAINING a system, chances are you only want the most recent updates since your last visit!

- G
- George Neuner
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Nov 18, 2013 11:01 PM

I use HTTrack when I need to clone (parts of) sites. It's available for both Windows and Linux.

formatting link

I don't know exactly how smart it is, but I've seen it reproduce some pretty complicated pages. It can scan scripts and it will grab files from any links it can find. It's also pretty well configurable.

George

- T
- Tauno Voipio
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Nov 19, 2013 7:00 AM

The OP has already lost the game. It is obvoius that the owner of the website does not want automatic vacuuming of the data.

If there were a script-passing downloader, the website owners will resort to CAPTCHAs, which are intended to thwart automats.

--

Tauno Voipio

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Nov 19, 2013 7:29 AM

I think the problem is that the page is "built" on the fly.

I don't think that's the case.

Imagine you were producing some number of PC's that support some number of OS's in some number of languages with some number of updates to some number of subsystems...

Undoubtedly, you would store this information in a configuration management system/DBMS somewhere. It would allow you to "mechanically" indicate the interrelationships between updates, etc.

An update would have a specific applicability, release notes, etc.

Why create hundreds (more like tens of thousands when you consider each download has its own descriptive page -- or 5) of wb pages when you can, instead, create a template that you fill in based on the user's choices? (model, OS, language)

Issue a bunch of specific queries to the DBMS/CMS, sort the results (by driver category) for the given (OS, language, model) and stuff the results into a specific form that you repeat on every page!

I.e., a driver that handles 4 particular languages would magically appear on the pages for those four languages -- and no others. To be replaced by something else as appropriate on the remaining pages.

Sure, you could generate a static page from all this and present

*that* to the user. But, why bother? What do *you* gain?

- R
- R.Wieser
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Nov 19, 2013 8:29 AM

Hello Don,

You could take a look at a program called AutoIt, though I'm not sure its available on all platforms (have been using it on Windows).

Although it origionated as a simple "record and replay" tool it has become quite versatile, enabeling you to script mouse-clicks dependant on data you read from webpages (as long as the browser has an interface to do so ofcourse).

Hope that helps, Rudy Wieser

-- Origional message: Don Y schreef in berichtnieuws l6ccq5$1no$ snipped-for-privacy@speranza.aioe.org...

- G
- George Neuner
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Nov 19, 2013 5:25 PM

Not necessarily.

Unless there are protected directories, HTTrack can grab everything: HTML, scripts, linked files/resources ... everything. It will even work cross-site [though not by default].

Sometimes you have to do a bit of work figuring out the site structure before you can configure HTTrack to clone it. More often than not, the difficulty with HTTrack is that it grabs *more* than you want.

George

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Nov 19, 2013 8:23 PM

What I am *most* interested in is the "cover page" for each download. It's the simplest way to map some bogus file name (x1234h66.exe) to a description of the file ("nVidia graphic driver for FooTastic 123, version 978 11/15/2013"). Too mch hassle to type or cut/paste this sort of stuff, otherwise!

(and file headers often aren't consistent about presenting these "annotations" -- so, you end up having to *invoke* each file, later, to figure out what it is supposed to do...)

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Wed, Nov 20, 2013 12:09 AM

The problem is finding a "list" of pertinent downloads (along with descriptions instead of just file names) for a specific product/os/language.

I.e., "these are ALL the files you will (eventually) need to download if you are building this product to run this os in that language. Then, ideally, fetch them all for you!

For an FP directory PER PRODUCT/OS/LANGUAGE, this would be easy; just copy the entire directory over! (IE sems to be able to do this easily -- along with many other products -- Firefox seems to insist on "file at a time")

E.g., Ages ago, a web page would list individual files and have static links to the files, their descriptions, release notes, etc. So, you could point a tool at such a page and say, "resolve every link on this page and get me whatever is on the other end!

This doesn't appear to be the case, anymore.

E.g., MS has all (most!) of their "updates" individually accessible, with documentation, WITHOUT going through the update service. Nothing "secret", there.

But, finding them all listed in one place (so you could "get all") is a different story!

I had found a site that had done this -- static links to each update/knowledge base article. All pointing to PUBLIC urls on MS's servers. I figured this would be an excellent asset to use to pull down ALL the updates to, e.g., XP before it goes dark!

Unfortunately, appears the feds didn' like the site! Nothing that *I* can see wrong with the page I was looking at (all LINKS and all to MS, not some pirate site). But, I have no idea what else may have been on the site; or, hosted by the same *server!

I'll now see if I have a local copy of the page and see if I can trick HTTrack into fetching them from a File: URL!