disable journalling?

- R
- Rob
  
  Contact options for registered users
posted
10 years ago

Mon, Mar 3, 2014 9:57 AM

I would like to disable journalling on a Pi that is running as a server, with reliable power. I read that the journalling writes data to the same place all the time, and thus causes the SD card to fail at some point in time. Turning off journalling should improve that. (I know the risk)

It is easy to turn off journalling on ext4, but the directions for doing that always mention that the filesystem should first be unmounted, then the flag can be toggled, then the filesystem should be checked and thereafter it can be mounted again.

This is a bit tricky when one wants to set the option for the root filesystem, and especially on a (remotely located) Pi. Is it possible to do this in any way?

- T
- Tony van der Hoff
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Mar 3, 2014 10:12 AM

Plug the SD card into a USB card reader; plug that into another linux box, and do your magic.

- A
- A. Dumas
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Mar 3, 2014 10:17 AM

Dude.

- M
- Mark F
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Mar 3, 2014 2:00 PM

What about system crashes?

The firmware in the card prevents the data being written to the same place, so you only need to be concerned with the total write capacity of the device.

There will be extra wear on the device with journalling and also some delays.

My choice would be to double the size of the memory device if possible and leave journalling on. Avoiding 1 filesystem failure due to a system crash or failure of the reliable power over the life of a dozen or so devices will pay for doubling all of the storage.

- R
- Rob
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Mar 3, 2014 5:27 PM

This would mean a request to the remote location to stop the Pi, remove the card, mail it to me, I can do this change and then mail it back and request the card to be replaced.

Several days of downtime. There should be a better way.

- R
- Rob
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Mar 3, 2014 5:32 PM

They have not happened yet. I am running Linux since 1992 and for many years I used ext2 and never had any problem due to the (infrequent) system crashes. I think it is not an important issue.

Unfortunately the scarce sources of information are not definite on this. Some state that this only happens in the Sata SSD disks that are used in PCs as system disk replacement, and not on SD cards.

Also the "fstrim" command, that issued OK results in previous kernels, no longer works since the Jan 3 #622 kernel. It is unclear to me if it never worked and returned fake results, or it worked but no longer does.

I also see claims that "the reason SD cards fail so quickly in the Pi is the journalling" (but I have not yet had an SD card fail)

My most important Pi has about 15% of used space on the card. However, it is unclear to me if the card "knows" that, given that fstrim apparently no longer works, and of course the card is written with an image at install time (and so the card may consider all blocks to be in use)

- T
- Tony van der Hoff
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Mar 3, 2014 5:53 PM

Hm, it's usually best to state all the constraints at the outset.

Plan B: burn a new sd card, configured to your requirements, mail it to the remote location, get them to swap it, mail the original back. Two minutes downtime.

I can't think of a better way.

- R
- Rob
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Mar 3, 2014 6:01 PM

Well, I don't rule out the possibility of turning off the journalling (maybe not removing the journal area but at least stop the writing to it) with a trick that only involves changing mount options or modifying the option area in the filesystem, and possibly a reboot.

- R
- Rob Morley
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Mar 3, 2014 7:00 PM

I'm thinking you should be able to use initramfs to load the tools to do this, but I've no idea how.

- M
- Martin Gregorie
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Mar 3, 2014 11:14 PM

Well, AFAIK nothing apart from Flash devices use wear levelling algorithms and so you won't fine them in the drivers of any operating system. So, it follows that the wear levelling must be in the SD card's internal controllers - and its the wear levelling algorithms that make sure that write hotspots don't stay in one spot on the card.

I agree that using a bigger card is a good idea. Look at it this way: wear levelling isn't going to mess with blocks that contain (relatively) static data, so wear levelling will only cause logical blocks to migrate round the free space on the card.

In the light of that, I recently decided that my 4GB card was a bit small since 'stuff' on on the ext4 partition occupies 2.7 GB I've recently migrated my Raspbian setup onto an 8GB card.

That was simple:

- connect two SD readers to a bigger Fedora box

- use cfdisk to create partitions on the new card

- use mke2fs and mkfs.dos to format the partitions

- use dd to shift all the data across.

At this time, although all the data have been moved correctly, mounting the new SD card on the Fedora box showed that something thought the new ext4 partition was the same size as it had been on the 4GB card. However, resize2fs soon fixed that.

Moved the card onto my RPi. It booted immediately and the free space is now up from around 1.3GB to 4.3GB.

One oddity, though some of the lower level views of Raspbian partitions seem to show that its using a 4Kb block size. Might this be an SD card optimisation? I believe that the SD card's internal block size (the one used for wear levelling) is 4Kb - it just seems like a somewhat oversized block for a Linux filing system, though equating it to wear-levelling units would make a lot of sense.

--
martin@   | Martin Gregorie 
gregorie. | Essex, UK 
org       |

- D
- Dom
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Mar 4, 2014 4:55 AM

This is true.

However, apart from blocks that haven't been written to at all, the controller has no idea of "free space". Once a block has been written it assumes there is data on it, even though the file may have been deleted from the file system. This is where TRIM comes into play. It can tell the SD card which blocks are currently in use and which really are free.

Two things identify the filesystem size. One is the partition table which identifies how much space has been reserved for the file system on the storage medium. The other is the filesystem superblock.

BTW you don't need to use mkfs to create filesystems on the new card if you are using dd to copy an existing fs onto it. dd will overwrite all the formatting info that mkfs.{vfat|dos} creates.

4KB is the normal blocksize for ext[34] filesystems. All my Debian based PCs have 4KB block sizes on their hard drives.

- R
- Rob
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Mar 4, 2014 9:10 AM

Yes, but I don't think the card will juggle around any written data, e.g. swap a block that is relatively static with a block that has been written many times so it will relieve the block of too many updates.

The card will only move written data onto blocks that have never been written before. Once you dd a new image onto a card with the same size, you have written all blocks and removed them from the pool of blocks usable for wear levelling, even when they are unused in the filesystem.

It may be different when the filesystem is later resized to the full card size, as that is likely to add blocks to the available area without actually writing to them.

- R
- Richard Kettlewell
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Mar 4, 2014 9:36 AM

Remember that wear levelling isn?t the only thing a flash controller has to do. It must also erase dead pages before they can be programmed again and it can only do this one erase block at a time. Therefore any live pages in the same erase block will be migrated to an unwritten page in a different erase block, even in the absence of any host write operations affecting those pages.

--
http://www.greenend.org.uk/rjk/

- M
- Mark F
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Mar 4, 2014 6:36 PM

I don't think you can assume that anything in the above 2 paragraphs is true.

I think keeping the file system the same size when you increase the size of the media will give the device more space to work with since it knows only the logical (user interface view) blocks in the partition can be in use.

- R
- Rob
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Mar 4, 2014 7:26 PM

It is difficult to get factual data on what is really happening, but my understanding so far has been that it works as described above at least on SSD devices, and the special command TRIM is used to tell the device that a block has been abandoned (i.e. the file that once was using it has been deleted) so it can use it again for wear levelling.

But I have also read that SD cards are different from SSD devices, so it may be different. Until lately, one could use the "fstrim" command on the Pi and it would report how many bytes it had trimmed (the number of blocks having been freed on the filesystem lately), but starting from the actual #622 kernel this command is no longer allowed.

I still have not found if this is a bug, if the command had been reporting success in the past but was doing nothing, or if there is another explanation.

Sure, that would be a good way to reserve blocks that you are sure are never being touched by the filesystem.

Coming back to the original topic, I still would like to know if it is possible to turn off journalling. It seems to be wasteful to write everything in two steps on a device that has limited write capability. I can understand that it is desirable when the device is in the hands of kids that may yank the powerplug at any time. But my system is always on and reliably powered and cabled, I reboot it only via the reboot command. Operating without journal should be as safe as it was before the journalling was added to Linux. (including running a filesystem check when the system crashes)

- M
- Martin Gregorie
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Mar 4, 2014 9:29 PM

OK, so resize2fs wrote the new FS size to all superblocks in the partition. I was wondering what it had updated. Obvious with hindsight.

I knew that dd would only overwrite just under 4 GB of the new partition, but not what it would make of the MS fs blocks occupying the rest of it, I thought that reformatting before using dd would be sensible.

Thanks - I had a scan round but didn't find anything definite. Is 4KB an actual block size or is it really a cluster of four 1KB blocks? I picked up hints that ext4 tends to work with clusters of blocks as its read/ write unit and that it might favour 4KB clusters that contain 4 blocks.

Presumably and ext[34] fs can't put data from more than one file on a block, so, as many Linux files are smaller than 4KB, using 4KB blocks would seem to be fairly wasteful of disk space.

--
martin@   | Martin Gregorie 
gregorie. | Essex, UK 
org       |

- R
- Rob
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Mar 4, 2014 9:37 PM

It still hasn't caught up with ReiserFS...

However, it is not as important as it used to be, except maybe for a system that stores mail in maildir format. Today, the average file size is a lot more than 20 years ago.

- D
- Dom
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Wed, Mar 5, 2014 7:01 AM

As the filesystem doesn't know about the additional blocks (until you run resize2fs), it doesn't care what is in them. Running resize2fs will update and create new superblocks as needed. It won't touch any blocks that are then defined as free space as there is no data to write to them yet.

I believe it is actual 4KB blocks. They seem to be arranged in larger groups so a new file may take a number of blocks, but when the space starts to get limited those extra blocks will be reallocated to other files to make most use of the available space. Initially allocating a number of blocks to a file helps to prevent any fragmentation - as long as there is sufficient free space.

It's only a small amount of wastage, and the larger hard disks currently available use 4KB sectors anyway, so using smaller block sizes would be slow - to write a 512Byte logical sector would mean reading 4KB, updating the content of 1/8 of it and writing it out again instead of just writing 4KB in the first place.

When you do a dumpe2fs to check the file system status, compare the Inode Count to the Block Count. I think you'll find that it's at least 2 blocks per inode. So that is a minimum of 2 blocks per file (more if you allow for linked files.

- M
- Martin Gregorie
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Wed, Mar 5, 2014 10:11 PM

More than I wish to remember, but IIRC all FAT file systems used 512 byte blocks. The problem was always the size of the FAT table, which had a fixed size and in which each entry corresponded to a disk block (FAT8). Eventually disks outgrew this limit and the quick fix was to say that each FAT entry referenced a cluster of 8 contiguous blocks (FAT12 and FAT16). After that the cluster size increased still further.

As you said, a bloody nightmare - only two copies of the FAT table and AFAIK nothing in the blocks to link them back to the FAT table which, in turn had no detectable integrity checking features so, as we all know full well, it was a very fragile structure. And don't even think of mention fragmentation the the fun & games of finding a reliable defragger. Hint: that wasn't the M$ one - I've seen *that* wonderfiul bit of software totally destroy a disk's contents more than once.

--
martin@   | Martin Gregorie 
gregorie. | Essex, UK 
org       |

- M
- Martin Gregorie
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Wed, Mar 5, 2014 10:20 PM

Noted.

OK

Just for fun I wrote a script that pipes the output of "ls -lR $1/*" into a small AWK program that reports the number of files it finds along with max, min and average file sizes. Rather to my surprise, it shows the average file size to be around 5-5.5 KB, so quite a lot bigger than I expected to find.

--
martin@   | Martin Gregorie 
gregorie. | Essex, UK 
org       |