That is the main reason as well as quite a few algorithms have input and output buffer (especially for object recognition). Dual buffering with separate acquire and processing buffers will speed up the frame processing rate.
Actually it is a colour sensor producing YUV data normally as 16bit 4:2:2 so as 16bit data it will need twice the information, and if converted to RGB THREE times the space.
In reality most robotic vision only need monochrome details unless they actually need to see colour information of objects (e.g. print colour checks, fruit picking, colour area of right colour means bottle/can has label on correctly).