Continuous disk write performance in ported real-time application

- J
- Jonathan
  
  Contact options for registered users
posted
14 years ago

Wed, Jul 29, 2009 9:24 PM

Hi all,

We are working on porting a real-time application from VxWorks to Linux as a sort of experiment to see if Linux fits our needs for this application. We are using a server distro of Ubuntu 9.04 running on a PCI-104 single-board computer with a Sandisk Extreme III 16 GB compact flash (rated 30 MB/s access) disk for storage. So far the port has gone quite well. We found a library called v2lin on SourceForge that allowed us to port the application making minimal changes to the actual code. For simplicity, we have ported the code to a single process, multi-threaded application, similar to how it ran in VxWorks.

One of the jobs of the application is to record the images that it is processing to disk for post-analysis/processing. The system processes these 640x480 images (300 kB each) at 10 Hz, giving a total of about 3 MB/s we want to write to disk. With the disk rated at 30 MB/s and tests showing at least a capacity of 10-15 MB/s, that seems like it should be possible for the system to maintain a 3 MB/s stream of data to the disk, even with file system overhead and all. In the current implementation, we get an image and write it to disk, the wait on the next image. So if it takes longer than 100 ms to write the image, we will miss the next image(s).

What we are seeing is that in most cases, it takes 1-3 ms to "write" each image file. Since this is obviously much faster than it is actually possible to write 300 kB of data, it is clear it is making good use of RAM to cache the file before actually writing it to the CF disk. However, sometimes it will take much longer to complete the image write call, from a few hundred ms to a few seconds. Using things like vmstat, iostat, and /proc/meminfo, we have been able to see that Linux is writing out the buffered data to disk in spurts. When it starts to get "behind" it will block the application and wait for some data to be written out to disk before continuing.

What we would like to do is to even out those spurts of actual disk writing to be more consistent. The application takes a relatively small portion of the 100 ms per image to process, so it should have quite a bit of time to write out an image to disk. But it seems to just sit there at times, content to let the disk cache build up. Does anyone know where we might look to tune something or change some setting that would get us more consistent disk writing? Thanks!

Jonathan

- K
- Kristof Provost
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Jul 30, 2009 7:40 AM

Have you experimented with the different I/O schedulers?

You may need to experiment, but I think the deadline I/O scheduler might be what you need.

-- Kristof

- V
- Vitus Jensen
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Jul 30, 2009 11:49 AM

[...300KB images at 10 Hz...]

As you are writing much more than the cache on your embedded device can hold why use the cache at all? I it were real small chunks you would win something because of write combining but in your case I don't see any benefit at all.

So I would use open(,O_SYNC), create another thread in your application, pass images to it via some sort of queue and let it do the writing. You need the queue because the CF disk itself could sometimes stop to do some housekeeping but otherwise you get a constant write rate.

Vitus

--
Vitus Jensen, Hannover, Germany, Earth, Universe (current)

- J
- Jonathan
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Jul 30, 2009 1:52 PM

s

We have played with the I/O scheduler some and it doesn't seem to make a difference. I think the I/O scheduler manages more when multiple processes are writing. In our case, there is really only one process writing to the disk.

Jonathan

- J
- Jonathan
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Jul 30, 2009 2:02 PM

s

u

Because the cache is useful for buffering and smoothing out the writing. Like you said, the CF may not always allow a constant write rate, so there needs to be buffering somewhere. Letting the OS do it is "free" in terms of development. We have 1 GB of RAM, of which our application uses very little, so while the cache cannot hold a hour worth of video, it can hold quite a bit. We have tried using O_SYNC and it takes longer than 100 ms to write the files every time (even though 100 ms seems like plenty of time to write 300 kB out to disk). We have not experimented with changing the architecture of the program yet (like adding in-application buffering). There already is a separate thread that is doing the writing to disk. Thanks for the reply.

Jonathan

- J
- Juergen Beisert
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Jul 30, 2009 4:34 PM

You should divide image recording and writing the data to the disk into separate threads. The image recording thread runs from the realtime scheduler (to meet your requirements), the writing thread with the regular scheduler. In this case you can use all the hints to force Linux to write the data to the disk, without disturbing the recording thread.

jbe

- V
- Vitus Jensen
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, Jul 31, 2009 7:46 AM

I see. So O_SYNC might write every fs allocation unit directly to disk or similar. BTW: which filesystem are you using?

Now you did confuse me, i thought it was a

for(;;) { get_image(); write(); }

loop?

Anyway, you asked for tuning of the cache behaviour. Look at /proc/sys/vm, documented in Documentation/sysctl/vm.txt. Decreasing "dirty" limits might help you.

I still think this is not the best approach: dependend on kernel version behaviour and if you hit some limit during write() the vm might decide to do the flush during your write() and lengthen the syscall. A seperate thread and some local queue, fsync()s to not compete against the cache about memory (as O_SYNC doesn't work).

Vitus

PS: 1GB of RAM and "embedded" :-O Where did those 128K devices go?

--
Vitus Jensen, Hannover, Germany, Earth, Universe (current)

- J
- Jonathan
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, Jul 31, 2009 7:57 PM

F

oes

n

d

any

n,

=A0You

ome

or

We're using ext3. In some reading, I came across a statement that sounded like fsync() on ext3 could cause more then just the file contents to be written to disk because of the journaling. Maybe it's something similar with O_SYNC.

No, it's actually multiple threads. Currently images are coming in over RPC and get submitted to a processing thread and an image recording thread.

Yeah, we've been looking at and playing with those limits. We can change the behavior of writing, but would still get dropped images.

Yeah, I've come to agree with you. There's really no reason for this real-time requirement, especially on a non-real-time OS. I added in a linked queue between the receiving and recording threads and that seems to work well. When a write takes too long, images get backlogged in the queue. When it is finished, the queued images get written to the disk fairly quickly. This cooperates with the disk cache well. Running "vmstat 1" shows 15 MB/s writes (over 1 second) every 5 seconds which is about right.

Thanks for the help. I kind of hated to "punt" but it's the right thing to do. It shows we can keep up, so in theory it seems like it should be doable the other way if we could just get the OS to cooperate.

Yeah, in this case "embedded" pretty much means "small." :) The SBC has a 1.1 GHz Pentium-M and an SO-DIMM slot on it. A 1 GB DDR SO-DIMM is cheap :) I booted a USB stick of Fedora 11 on the thing and brought up Firefox no problem. It's probably comparable to my ASUS Eee PC netbook in terms of processing power....

Jonathan

- V
- Vladimir Jovic
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, Aug 5, 2009 2:00 PM

Linux is not really that good with threads. Better way would be to have separated processes

- J
- Juergen Beisert
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, Aug 5, 2009 2:29 PM

Since when?

And you really want to play with shared memory, process synchronizing and so on. Uhhhh...

jbe

- V
- Vladimir Jovic
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, Aug 5, 2009 4:01 PM

[cut]

My experience on linux threads is limited, therefore the problem with threads is most likely on my side ;)

Exactly. I refuse to play with mutexes, semaphores, threads synchronisation, data accesses, etc.

Read this why I made that suggestion :

formatting link

- C
- cs_posting
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, Aug 5, 2009 6:36 PM

o

so

It should not be too hard to invent your own, but don't most realtime frameworks gives you shared memory fifo mechanisms to solve this problem?

- J
- Juergen Beisert
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, Aug 5, 2009 7:58 PM

Okay. Threads are not the anwer on every questions. Sometimes they are evil, sometimes I`m happy they exist and make my work so easy. It depends....

You miss a chance to improve you knowledge ;-)

Its okay to write regular programs in this manner. But don't do so, if you try to write some program that also has to meet special timing specs. How do you want to predict the behaviour of your program if it can block when it tries to write some data into a pipe? Or into a file on a disk? Or on the screen? How long will this pause take? You can't predict it. So, you can't write programs with strict timing requirements in this way.

jbe

- J
- Juergen Beisert
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, Aug 5, 2009 8:05 PM

Sure. But if you decouple the realtime part from the not-realtime part in such way, it doesn't matter if your are using two threads or two processes to do the work. Hmmm, but using two threads do not need a context switch, so they might be faster on embedded systems with low computing power (and no cache flush like ARM architecture for example).

jbe

- M
- Michael Schnell
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, Aug 5, 2009 8:53 PM

Why ?

-Michael

- V
- Vladimir Jovic
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Aug 6, 2009 9:19 AM

I have read somewhere (long time ago, now have no idea where) that context switching in linux is almost instant, with neglectful performances loss. Therefore your remark about context switching doesn't hold

- V
- Vladimir Jovic
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Aug 6, 2009 9:27 AM

[...]

Before writing to a pipe, use select() with a timeout to see if it is possible. If the pipe is full for some short time, the process can either sleep for some time and retry, or it can report an error. It doesn't have to be blocked.

As threads, processes can have RT priority as well.

btw in the same way the process would be blocked, your thread would be blocked as well. If it happens, I would call it a bad design, or a bug.

- J
- Juergen Beisert
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Aug 6, 2009 4:49 PM

Okay.

jbe

- J
- Juergen Beisert
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Aug 6, 2009 4:49 PM

Okay.

jbe