"volume packing"

D

Don Y 15 years ago

Hi,

I'm looking for a tool that will let me pass "portions of filesystems" to it (i.e., lists of file hierarchies) along with a "volume size" and have *it* come up with

*an* "optimal" packing arrangement to put those files onto the minimum number of volume-sized media.
[this is similar to packing software for distribution]

MS had a similar tool in one of their ancient kits that was effective for packing *few* files onto floppies (yes, *that* ancient!).

The tool should be able to take into account block sizes and overheads of the target media (i.e., a 1 byte file takes up a lot more than 1 byte!).

And, the tool should try to preserve related portions of the hierarchy on the same medium, where possible. (i.e., /foo/bar and /foo/baz should cling together more tightly than either would with /bizzle/bag)

My current need is for a windows platform but I can easily port something from "elsewhere" if need be.

Thanks!

--don

Vote

D

David Brown 15 years ago

/Optimal/ cannot be done (at least, not in non-polynomial time, as far as is currently known). You are looking for something roughly equivalent to the "knapsack problem", which is a well-known NP-complete problem. The best you could hope for is a reasonable solution.

I'd be surprised if you find anything useful. Perhaps if you try to describe what you are trying to do, and why it can't be handled by a simpler solution, it may be possible to help. The only vaguely related problem I could imagine is deciding how to split the packages of a large Linux distro between different CD images (or different floppy images for old distros, or different DVD images for the newest and biggest distros).

Vote

D

Don Y 15 years ago

Correct. I was trying to acknowledge this by emphasizing "an" (optimal).

Exactly. There are multiple (unknown) ways of packing a set of objects onto a medium. In terms of "quality of solution", they might be equivalent (i.e., two media -- A & B -- have free spaces a1 and b1 in a given packing while an alternate packing scheme can result in identical free spaces -- but different

*contents* on their respective media). Or, they can be "less ideal" (i.e., total number of media remains the same -- and, thus, the total amount of free space -- but the contents are spread out over more volumes). *An* ideal packing would use exactly total_size/volume_size volumes with free_space on only the last volume. *An* "optimal" packing would use the minimum number of volumes necessary to contain all of the objects (ideal is better than optimal). [Of course, all of the above subject to the above constraints.]

I generate large data sets (almost) daily. Up to this point, I have just been generating (gzip'ed) tarballs and split-ing them into "volume sized" pieces.

This works fine for getting them onto the media with the least amount of effort/intervention. I.e., this is an "ideal" packing (using the above terminology).

But, going back and trying to retrieve/inspect a portion of a data set is a real PITA. It's easy to find a particular

*day*. But, with that set of media, the only reliable way of finding what I'm looking for is to mount all of the media, cat them together, pipe that through gunzip (or let tar do it for me), then de-tar, locate the file of interest and *extract* it.

It would be much nicer to just mount *a* medium, browse to that part of the filesystem and see if the file sought is present. If not, umount and move on to the next in the series.

(I think this is also more tolerant of media errors -- i.e., a bad read on volume N could make it impossible (?) to retrieve the contents of volumes > N, depending on the nature of the read error.)

Vote

J

Jon Kirwan 15 years ago

I'd probably want to _add_ that if one of the media is destroyed or otherwise non-functional either partially or completely, that the rest can still be recovered and used.

This is reminding me of two areas -- hibernation and sleep tapes (we'd lose part of the tape in getting it tangled up, for example, and could splice it back into functionality but would lose a file) and also breaking up transmissions over the internet into smaller sections for broadcast, where compression was desperately wanted (we were using 2400 bps modems) but so was the handling of files within the set should it come to pass that one of the sections was lost, the rest could still be recovered (it often wasn't possible to ask for a repeated broadcast of it.)

I don't have an answer off the top of my head. But one place I'd look is towards Dr. Knuth. He might be receptive to an email with this kind of question and have some thoughts about where to look or what to think more about. I'd consider writing him were I you. Just on a lark.

Jon

Vote

D

Don Y 15 years ago

You could do that by treating the volumes as a RAID array. But, that increases the amount of data you preserve.

These aren't long term backups. Rather, I might want to look at something that I did a few days ago to compare against a new dataset.

If I can't find something ready made, I'll just brute force something. The advantage, there, is that I know how *my* datasets are organized so I can take advantage of that in the algorithms. E.g., if "1, 2, 3, 4, 5, 28" form a "best fit" for a volume, it would probably be "friendlier" (to me) to use "1, 2, 3, 4, 5, 6" and have a little extra waste.

Vote

N

Nobody 15 years ago

This specific case is the "bin packing problem", which is NP-hard. According to wikipedia, approximate algorithms exist which asymptotically use only ~20% more bins than an optimal solution.

Depending upon the typical file sizes, a brute-force (i.e. optimal) solution might not be out of the question. If most of the files are only a few blocks, that would drastically reduce the number of distinct partitions.

Vote

N

nospam 15 years ago

formatting link

Vote

V

Vladimir Vassilevsky 15 years ago

This is not what they meant. For a *large* number of *random* items, the *average* result achieved by greedy algorithms would be somewhat 20% worse then optimal. Iterative algorithms could be close to optimum depending on how much of the processing is available.

Vote

F

Frnak McKenney 15 years ago

--snip--

Hi, Don.

Splitting up your problem and only attacking the portion I can claim any expertise on... why not just check the 1403 printout to see which of the 2400' tape reels contains your file?

More seriously, many of the desktop-PC backup programs I've used keep an index to help you figure out which diskette/tape cassette contains a particular file. Or, often more useful, remind you that you had deleted the file prior to that partcular backup, and that you needed to be checking an earlier backup set.

If you don't mind using 'grep', this could be as simple as a text file with the backup date, file names, media volume IDs, and timestamps:

grep critical_file Backup-Index-bagheera-2011-06-01.log

yields:

/etc/sysadmin/critical_file.cfg 2011-05-15 02:41:12 Diskette 12

The actiual _construction_ of this log file depends on your specific situation, and is therefore left as an exercise for the student (accompanied by the usual hand-waving and superior look ).

Does this help?

Frank McKenney

Until the milennium comes and countries cease trying to enslave others, it will be necessary to accept one's responsibilities and to be willing to make sacrifices for one's country -- as my comrades did. As the troops used to say, "If the country is good enough to live in, it's good enough to fight for." With privilege goes responsibility. -- E.B. Sledge / With the Old Breed at Peleliu and Okinawa

Vote

R

Rob Gaddi 15 years ago

Don --

Do you really need to put this on some sort of "small" (no more than say

4.7GB) media? Our level 1 backup system here is a 1.5TB USB hard drive, on which we keep, simply as directories, full copies of the entire server daily for the last two weeks and biweekly for the last two months. I think it cost $100, thirty lines of bash script, and two cron jobs.

Newegg's got on their front page today a 2TB internal drive for $60. Four of those would give you a 6TB RAID5 array; are you really generating so much data so fast that you would have to worry about running that out? Or is this an archival issue?

Rob Gaddi, Highland Technology Email address is currently out of order

Vote

D

Don Y 15 years ago

(sigh) Maybe *that's* the problem -- I'm using 3600' reels! :-/ (actually, I still have one 9-track)

Yes. But, that means I have to "backup" the data (i.e., using a "backup" program). Most of these mangle the data in the process in an attempt to compact or otherwise pack it in a way that is convenient for *them* to access.

I would just like "plain files" (bad for a "backup" because you waste, on average, half a sector per file -- making the "backup" larger than necessary)

On the UN*X machines, I use bacula for this. Then, I can just query the SQL database (that it maintains) directly. Though it still doesn't help me *access* the files once I know where they are...

(hmmm... I know of a surplus CD jukebox that holds a few thousand volumes. Maybe I should drag that home?? :> And, replace the CD drives with DVD-ROM... umpteen zetabyte data store!)

It gives me something else to think about for "options". E.g., maybe NFS export the datasets (on the PC) and NFS mount them on a bacula-enabled machine. (I could also try, again, to get bacula running on Windows :-/ )

Vote

D

Don Y 15 years ago

I generate several GB (uncompressed) daily. Currently keeping the "live" datasets on a 1T drive (so I can examine "recent changes" easily) and moving "older stuff" onto optical media to keep space available on the drive (optical being dirt cheap with the exception of the manual handling involved).

I live in (irrational) fear of single spindle failures taking out *lots* of data. If I build a RAID array, then all is well until it decides it has to rebuild itself -- which leaves me twiddling my thumbs for the better part of a day. (I've also had issues with some RAID implementations where removing a drive and reinstalling it caused the array to complain).

My current approach seemed like the "80% solution" -- 80% of the benefit for 20% of the cost/effort.

I've also favored CD-ROM as an optical medium as I seem to have fewer "recovery errors" there than with the DVD's (I've tried

+R and -R and can't seem to decide if either is "better"). But, with CD's, you want to compress so you don't end up burning CD's all day long...

Actually, thinking about your idea again -- only differently -- maybe I could use a *pair* of drives and alternate between them. I.e., odd days on drive 1, even on drive 2. So, a single spindle costs me *half* of my data and, more importantly, doesn't cost me "the most recent" (nor "the oldest") data (?)

[to be honest, I've never really *lost* a spindle -- in terms of a hardware failure -- so my paranoia is partially unfounded. But, I've had an OS crash take out a spindle (and it's *replacement*) and that experience has left me gun-shy]

Vote

D

David Brown 15 years ago

RAID is not about preventing data loss, except for the data that is generated between backup runs - it is about minimising your downtime and effort if a disk is lost. So it makes sense to have RAID on your main drives - but to make sure your backups are independent of the RAID system.

The reality of data loss is that human error is the typical main cause, followed by software-related issues (OS goes bananas, software goes mad, viruses, malware, etc.). You can minimise those risks with good procedures, and good choice of OS and software. Power failures are normally more common than other types of hardware event, and can cause disruption and loss of data (especially with poor choices of OS, filesystem, and configuration), but are easily mitigated with an UPS.

None of these causes of data loss are helped by RAID. If you've got a nice mirror setup, and some twit deletes the wrong directory, it will be gone from both drives.

But RAID /does/ help against disk failure - and disk failures /do/ occur. And when disks fail without RAID, you'll got a long job ahead to get everything restored and working again, even though you know your data is safe on the backups - RAID is a cheap insurance against this.

Picking the right RAID setup for the job is not always easy. But one thing you should always do is practice replacing drives /before/ you start to rely on the RAID set - then you know how to deal with emergencies when they happen. And make sure you know how your drives are numbered and how they relate to the physical drives - you do /not/ want to replace the wrong drive!

Vote

F

Frnak McKenney 15 years ago

--snip--

No, you just need to build an easily-accessible "index" or "log" of some kind. Which, I admit would require a shell script or its equivalent, if you wanted to capture the file names as you transferred them. If capturing the filenames "on the fly" -- or capturing the output of whatever command you use to transfer the files -- is impractical, you could post-process the tape/CD/DVD to build the index.

I admit I'm assuming that the index would be noticeably (usefully) smaller than the original data. If you're working with 1Tb of (say)

8-byte files (Ack! Phlbbbbt!) this may not be the case.

How long does it take to generate a list of files on one of your media once the transfer (backup) is coplete? Capture this, stuff it into a log file, add the volume ID and you have your searchable index. Yes? Or am I overlooking something?

Sorry... the only part I was attempting to address was your concern over having to mount multiple media to locate a given file. I'm familiar with the name 'bacula' as a *NIX backup program, but not its details; I would have expected it to track volume ID in some way, but I've never tried using the program.

A local bank I did some work for a few years back used CD-RW jukeboxes. How long do you need to retain your data? A week? A month?

Hm. I don't know anything about your environment, but I hear that it's not too hard to run a MSWinXX virtual machine under *NIX these days, and I'm pretty sure you can run some flavors of *NIX within a VM hosted by MSWinXX. For some flavors of MSWinXX. Is this an option?

In any case, you have my sympathies.

Frank

The tendency to achieve our foreign policy objectives by inducing other governments to sign up to professions of high moral and legal principle appears to have a great and enduring vitality in our diplomatic practice. -- George F. Kennan / American Diplomacy 1900-1950

Vote

D

Don Y 15 years ago

Ah, I understand! I admit I'm assuming that the index would be noticeably (usefully)

Yes, I (now) understand your point. It would be a no-brainer operation. Roughly "ls -alR > medium.log"

It keeps a *real* (SQL) database of the backup's contents.

Until I'm done with the project + a little. It's mainly so I can see how things are changing/have changed and, hopefully, sort out where I might have "made a wrong turn" (in the past).

For this reason, its not "worth" lots of $$, time, space, etc. OTOH, *losing* some part that I might eventually need to reference would be "annoying", in the least (part of the dataset currently includes source snapshot -- though I could put that somewhere else, at some additional maintenance cost)

I think the easier route is to export the filesystem and grab it from the UN*X boxen. There will be some bandwidth issues but at least it will be a clean "interface".

Running under always seems fraught with potential for screwup. E.g., keeping track of what you are "talking (typing) to" and where this "display" is actually coming from...

(I already have a lot of hosts that I have to interact with during a typical day so it's a fair effort for me to remember what I'm doing while I'm doing it... e.g., "purple keyboard" means I'm talking to a Sun box...)

That and 25 cents will get me... ? :>

Vote

F

Frnak McKenney 15 years ago

True. FWIW, in your place I might be happy to settle for something less than "true" optimality. There are other constraints: simplicity (assumed to be related to robustness ), time to calculate optimality, how often I might need to mount multiple volumes to retrieve multiple files, and (more hand-waving) "so forth". Oh, and how much faith I have in the assumption that a single file will always fit on one volume.

Depending on the number of volumes you need for a typical backup and their cost, you might be able to settle for "no more than 10% free space". Have you (in your copious free time, I realize ) tried simulating the "loss" of a simple brute-force tree walk vs. "something" more sophisticated? It's not just volume wastage you need to worry about; there's also how much of your time you spend trying alternatives out and posting to those time-sucking USENET newsgroups...

Although, in all fairness, I think you did start off asking whether someone knew of an existing package.

[...]

With the volume ID in there somewhere. Right. And ideally captured "on the fly" so you don't have to re-mount all twenty 3600' 800BPI tape reels.

(I'm kidding, and I didn't actually do the math. But I _do_ remember being shocked at the effect of big tape blocks on how much you could put in a reel. Those 5/8"(?) inter-block gaps really added up if you stored card images as unblocked 80-byte blocks vs. 32000-byte blocks. )

[...]

I think I recall a figure of 1 Tb being mentioned by someone (it might even have been you ). That's 223 4.5Gb DVDs... or, in technical terms, "a bunch". I suppose having a farm of 223 DVD drives would speed things up a little, if one could ignore the possible failure rate.

Did you already say what you were currently using?

You might want to look into (say) 1 Tb hard drives, but mounted in removable caddies like these if you haven't already:

formatting link

Backing up would be a _lot_ faster than it would be to DVDs.

Yup. And 'costs' include your time to set this up and use it regularly(!).

[...]

Agreed. Complexity adds costs in 'distraction time'; using known or already-in-use reduces these.

My current keyboard. mouse, and LCD are attached to an IOmega KVM-blob that lets me switch between my office desktop (Dell 450MHz GX1 running OS/2) and my laptop (Dell D630 2400(?)MHz running Ubuntu

10.4). Fortunately there are enough visual differences in the two "desktops" so I don't lose track of which one I'm connected to... but I still occasionally find myself trying to from the OS/2 desktop and into a laptop app. (Added to List Of Things I Need To Fix About The Universe as #8,730 ).

How about a cuppa tea fresh off the "Mr. Tea" machine?

(I keep meaning to to cover up the Mr. Coffee logo with a color picture of B.A. Baracus, but I don't think any of my nephews or nieces would recognize him. ).

Good luck.

Frank

We enjoy the foibles of the great, their follies, and their self-proclamations, as it titillates both our own grandiose folly and our feeling of self-importance -- as we feel ourselves, rightly, superior to them. But this thrill is cheap and it is as nothing compared to our enjoyment of real heroism. Why? Because when we see real heroism, the heroism of the ordinary person forced by circumstances to act bravely, we identify with that man or woman and we say, "If they can do it, then perhaps I could. too." -- David Mamet / True and False

Vote

D

Don Y 15 years ago

The latter is not a problem, currently -- though I realize for a "general" solution it would be!

I was assuming I could just point a tool at a directory and let it sort out what should go where -- building a list of files for each volume.

A solution once is a solution *always* :>

If I had thought of this "in the beginning", I wouldn't have to deal with it *now* :-/

Naw, you buy the media in bulk... comes on the back of a semi. Record however much you need and then cut it into reel-sized lengths. Hire the neighbor kids to wind the stuff onto the blank reel hubs... (keeps crime rate down as no one has time to get into trouble if they're busy winding tape!!)

I wrote a Pertec driver that would let me mount tape *filesystems* (R/O). It was cool to watch the transport try to decide whether to read forward or reverse based on where it was at the time. Kinda "therapeutic"... like watching a pen plotter (or watching a *fishtank* for those without mechanized toys as distractions!)

I have a 1T drive that I keep "recent" images on. From there, I move older images onto CD's (cat | tar | gzip | split). This is relatively painless to *do* because it's "unconditional". OTOH, retrieving something means mounting all the CD's, glueing them together, decompressing, etc. just so I can "see" the entire archive.

Yes, I'm just gunshy of single spindle failures.

Of course, my current approach leaves the most valuable (?) datasets in that "single spindle" situation (since *oldest* datasets get moved onto "other media").

Dunno. I suspect I will just cross fingers and hope I'm done before the fan gets hit...

But, once in place, I could use for other needs as well.

I've had bad luck with KVM's. They either latch up (if a machine is powered down at the time, etc.) or are too PC-centric (e.g., won't work with a Sun box) or are too *expensive* (a "Rose" by any other name :>)

My work area is arranged as a large "U". I have three "stations" set up -- one on each "edge" of the U. Each is a pair of 21" monitors connected to two "dual headed" machines. I.e., there are two machines for each workstation (~6 total) with each machine having two monitors attached to it. So, (calling the machines A and B and their video outputs 1 and 2) the monitors either display A1+A2, B1+B2, A1+B2 or B1+A2.

There are two keyboards (and mice) in each workstation -- one atop the table and one in a drawer beneath.

So, I have to figure out what I'm *looking* at (i.e., if I am looking at the left monitor, then I know this is either A1 or B1 that I am seeing). Then, I have to figure out what keyboard/mouse interacts with that display. *Then* I can so whatever I was planning on doing! :-/

But, it gets worse! :>

E.g., I can be in an X server running on a Windows box (for example, "A"). In that X session, I could be connected to a UN*X box (to make things interesting, *B*!). So, even though what I see in the X server tells me I am talking to (e.g.) a Sun box, I have to remember to type on the *PC* keyboard and *not* the Sun keyboard.

(it gets even more amusing when you start adding telnet/ssh sessions daisy chained through multiple hosts! Type "exit" and you never know *which* machine's prompt you'll be seeing!)

I have one in front of me, as I type this! (I drink tea, almost exclusively -- 100 tea bags in 2 weeks)

Ha! The A-Team Movie has an homage to the original series (when they break Murdoch out of the wacky ward)

Thx! Time for a refill...

Vote

D

Don Y 15 years ago

Ha! This looks like it might be interesting! At least, perhaps, a framework that I can bend to apply *my* criteria (which differ from those stated in the "summary" for the project)

Not enough information posted on this one to determine the criteria it uses :<

Thx!

Vote

"volume packing"

Join the Discussion

Didn't find your answer?