x86 move multiple

Does the x86 instruction set have a MOVEM/block move/move multiple opcode? What's it called?

I think that, if it does, it would probably invoke PCI Express block packets, reducing per-word latency a lot. Well, if they did it right.

--

John Larkin Highland Technology, Inc

jlarkin at highlandtechnology dot com

formatting link

Precision electronic instrumentation Picosecond-resolution Digital Delay and Pulse generators Custom laser controllers Photonics and fiberoptic TTL data links VME thermocouple, LVDT, synchro acquisition and simulation

Reply to
John Larkin
Loading thread data ...

It won't. If you want PCI to move blocks of data you need to tell the card to transfer a block. There is no way to do that from the CPU side.

--
Failure does not prove something is impossible, failure simply
indicates you are not using the right tools...
nico@nctdevpuntnl (punt=.)
--------------------------------------------------------------
Reply to
Nico Coesel

Yes.

I don't remember. (Clearly, then, the mnemonic would be IDR).

I would be surprised if it did. Designing the system so that the core could reach that far outside of itself to make something happen would be way hard, and an invitation for bugs galore.

--
My liberal friends think I'm a conservative kook.
My conservative friends think I'm a liberal kook.
Why am I not happy that they have found common ground?

Tim Wescott, Communications, Control, Circuits & Software
http://www.wescottdesign.com
Reply to
Tim Wescott

We're addressing a block of multi-ported RAM inside an FPGA, over cabled PCI Express, so it looks like memory-memory moves. I know that we can do single 16 or 32-bit transfers as simple memory read/write opcodes, but the PCIe overhead is around 1 us per transfer. And we can do DMA transfers from the CPU side, with all the driver setup gather-scatter-interrupt hassle. I'm pretty sure that a PowerPC can do a MOVEM opcode and blast blocks of data over PCIe, but maybe Intel can't.

--

John Larkin         Highland Technology, Inc

jlarkin at highlandtechnology dot com
http://www.highlandtechnology.com

Precision electronic instrumentation
Picosecond-resolution Digital Delay and Pulse generators
Custom laser controllers
Photonics and fiberoptic TTL data links
VME thermocouple, LVDT, synchro   acquisition and simulation
Reply to
John Larkin

REP MOVSB or REP MOVSW, IIRC.

About a 50% chance it's faster than a loop. It has its "issues".

Reply to
krw

I think you apply a rep(x) prefix to a movs(y) where x is the terminating condition of the repetition and the y is related to how many bytes of the string you move at once.

Isn't that stuff pretty decoupled from the CPU? Even on-chip stuff tends to be decoupled these days.

Reply to
Spehro Pefhany

PCIe is supposed to be fully transparent compared to "legacy" PCI, so that even old drivers aren't supposed to be able to tell the difference. So a memory-memory block move that involved memory-mapped PCI space has to work transparently over PCIe. New CPUs don't even have PCI interfaces... everything is PCIe. You'd think they would reach out to keep PCIe packet overhead from crushing transfer rates by a factor of 50 or so. But realtime stuff like this is *so* hard to get answers on.

--

John Larkin         Highland Technology, Inc

jlarkin at highlandtechnology dot com
http://www.highlandtechnology.com

Precision electronic instrumentation
Picosecond-resolution Digital Delay and Pulse generators
Custom laser controllers
Photonics and fiberoptic TTL data links
VME thermocouple, LVDT, synchro   acquisition and simulation
Reply to
John Larkin

Indeed it depends on the architecture. I looked into this when developing a PCI card and found out x86 PCs never initiate burst transfers. The PCI device has to become master and push or pull data from the host.

--
Failure does not prove something is impossible, failure simply
indicates you are not using the right tools...
nico@nctdevpuntnl (punt=.)
--------------------------------------------------------------
Reply to
Nico Coesel

rep movsb rep movsw rep movsd

unless you're tareting 386 or earlier a tight loop is faster.

all this stuff predates PCI by several years.

--
?? 100% natural

--- Posted via news://freenews.netfront.net/ - Complaints to news@netfront.net
Reply to
Jasen Betts

Don't you have Intel processor reference manuals in your organisation?

The widest that Intel MOV instructions can do is 128bits for XMM register loads MOVDQA is the fastest (move double quadword aligned). You might also want MOVNTDQA if it is volatile data being read/written.

You will have to experiment with what loop structure you put around it for maximum performance. It is seriously architecture dependent.

Optimising compilers usually get it about right if told the target CPU but on a tight loop their heuristics can be wrong.

--
Regards,
Martin Brown
Reply to
Martin Brown

Hell no. We use ARMs.

Yeah, the only way to know a lot of stuff like this is to experiment.

The compiler would also have to know that we're doing a move from local memory to/from apparent local memory that is in fact at the end of the PCIe cable.

I know a guy who wrote a popular book on the PCI bus. I asked him what controlled whether a block of PCI-resident memory was cached or not, and he had no idea.

--

John Larkin                  Highland Technology Inc
www.highlandtechnology.com   jlarkin at highlandtechnology dot com   

Precision electronic instrumentation
Picosecond-resolution Digital Delay and Pulse generators
Custom timing and laser controllers
Photonics and fiberoptic TTL data links
VME  analog, thermocouple, LVDT, synchro, tachometer
Multichannel arbitrary waveform generators
Reply to
John Larkin

They are quite cute.

The "Non-temporal Hint" etc. But alas it is only a "hint" so the execution unit can potentially ignore it. Providing the right hints can prevent volatile stuff clogging up cache lines or it can slow things down. The only way to find out is to try it on the target platform.

--
Regards,
Martin Brown
Reply to
Martin Brown

Or read the PCI spec and don't use the term 'DMA' again. Its memory to memory transfer! DMA has died together with the ISA slot.

The compiler can't know that. Usually a piece of physical memory (in this case the memory inside the PCI card) is mapped to a piece of virtual memory your application can access. You can use some OS API calls to translate a virtual address into a physical one but that won't help you much forcing a burst transfer. Like I typed before: burst transfers are initiated by PCI devices, not the host!

The answer is simple: that is controlled by the BIOS and operating system. In most cases there is a memory area which is cacheable (usually the system's memory) and pieces of memory which are non-cacheable (usually memory mapped I/O devices). There is not much use for caching a peripheral although I have come across PCs which cache the I/O devices so you need to read a register after writing it for the cache to flush itself into the hardware device. Very annoying....

--
Failure does not prove something is impossible, failure simply
indicates you are not using the right tools...
nico@nctdevpuntnl (punt=.)
--------------------------------------------------------------
Reply to
Nico Coesel

Well, we paid $8K for the Lancero PCIe DMA driver for Linux. It does all the horrible paged linked-list gather-scatter stuff, and they sure call it DMA.

How do you do a DMA trasfer into/out of a virtual page that's non-resident? Virtual memory is evil; I wish it had never been invented.

There are x86 opcodes to move bytes, words, longs, and there's one that moves four longs. And there's the REP prefix to do block moves. My question is, how are these implemented as PCIe packets?

That's not simple!

--

John Larkin         Highland Technology, Inc

jlarkin at highlandtechnology dot com
http://www.highlandtechnology.com

Precision electronic instrumentation
Picosecond-resolution Digital Delay and Pulse generators
Custom laser drivers and controllers
Photonics and fiberoptic TTL data links
VME thermocouple, LVDT, synchro   acquisition and simulation
Reply to
John Larkin

DMA is an ancient term which causes a lot of confusion when used in combination with PCI. Traditionally DMA is a seperate controller which seizes the bus and does memory transfers. In PCI land there is no such thing. PCI only knows memory to memory transfers which can be initiated by any PCI device (to be more exact: any PCI bus master capable device).

They are not. If you do a block move instruction you'll see a PCI transaction for each byte/word/quad. The CPU will perform a read and write operations for each element. Put a logic analyzer on a legacy PCI slot and you can see for yourself :-) It may help to use the parallel port to trigger the logic analyzer (make a pin high on the parallel port before executing the block move).

The problem is that the CPU doesn't know anything about the memory you are going to read from / write to. So it has no way of controlling how and where the data appears on a bus. It just reads and writes bytes/words/quads from virtual memory locations.

Before anything else virtual addresses get translated to real addresses in order to route the data to/from the right peripheral or memory. At that point the information about which instruction(s) caused the transfer is lost. Its just a series of memory accesses. This is the reason that a PCI device must initiate a burst memory to memory transfer.

AFAIK: When transferring bursts of data from a PCI card you'll need to allocate a piece of physical memory which is mapped into the virtual address space of the application. Next the physical address is programmed into the PCI card. After that the PCI card gets a 'go' command and pushes the data into the PCI host. Because the PCI host knows about the physical memory layout it can transfer the data to the proper memory locations. Once the transfer is done, the application can access the data from the mapped area.

--
Failure does not prove something is impossible, failure simply
indicates you are not using the right tools...
nico@nctdevpuntnl (punt=.)
--------------------------------------------------------------
Reply to
Nico Coesel

Back when Intel was just a bad dream, DMA was a common term. It used to mean that a piece of hardware, like a disk controller, either had its own port into CPU memory, or could seize and master the memory bus in single (Unibus type) systems. Intel invented the "DMA controller" which was a separate move engine not necessarily associated with any specific controller device. Intel was always out of the mainstream of computing, and kluged everything they touched.

PCI, yes. But what about PCI Express? Latency is around a microsecond per transfer, about 30x slower than PCI. So, does PCIe run 30x slower, or is the system smart enough to aggregate data into packets?

There's no reason that the CPU (in its microcode or whatever) can't know that the target is PCIe. It's not as if an x86 is simple or anything.

But it's *so* hard to get solid info on low-level stuff like this.

--

John Larkin         Highland Technology, Inc

jlarkin at highlandtechnology dot com
http://www.highlandtechnology.com

Precision electronic instrumentation
Picosecond-resolution Digital Delay and Pulse generators
Custom laser drivers and controllers
Photonics and fiberoptic TTL data links
VME thermocouple, LVDT, synchro   acquisition and simulation
Reply to
John Larkin

Now imagine 16 PCI cards (*) in a system with 1 or more 'DMA' areas. You'd run out of DMA channels very quick. Not to mention every driver wants the highest priority. The way PCI does it the playing field is much more leveled.

(*) No I'm not delirious:

formatting link

That is a good question. Did you already read this paper from Intel:

formatting link

Someone with a similar problem:

formatting link

The MMU is in between. That makes the difference (and the problem).

I got most of my information on PCI by just measuring using a logic analyzer.

--
Failure does not prove something is impossible, failure simply
indicates you are not using the right tools...
nico@nctdevpuntnl (punt=.)
--------------------------------------------------------------
Reply to
Nico Coesel

Note how carefully they avoid revealing actual performance, by not labeling one or the other axis on the graphs.

Yup. Looks like you need DMA to get any decent transfer rate.

The latest compile of our Altera FPGA has a mess of test multiplexers that can bring signals out to test connectors on the front panel of the box. Some of the signals are PCIe activity things. We'll have to write some code and experiment with that.

Thanks for the links.

--

John Larkin         Highland Technology, Inc

jlarkin at highlandtechnology dot com
http://www.highlandtechnology.com

Precision electronic instrumentation
Picosecond-resolution Digital Delay and Pulse generators
Custom laser drivers and controllers
Photonics and fiberoptic TTL data links
VME thermocouple, LVDT, synchro   acquisition and simulation
Reply to
John Larkin

The problem is that it needs to know that it's going to want to read some number of words from a PCI device, and furthermore needs to know that while it's reading them it's not going to get interrupted...

Intel-style block move is the wrong tool. some sort of integrated pcie controller peripheral might be better (but if it's tightly bound there's still the interrupt problem. And tacking this onto a CPU makes it start to look like a microcontroller....

--
?? 100% natural

--- Posted via news://freenews.netfront.net/ - Complaints to news@netfront.net
Reply to
Jasen Betts

It's not as if the x86 instruction set has any sort of essential purity or anything. It's packed with complex kluges heaped on one another. A non-DMA block-transfer over PCIe mechanism could have been included.

In the days of ISA bus, one instruction could access a hardware register, and the required interface was a couple of TTL cans. The progression has been to complexity, latency, and uncertainty. PCs are optimized for average throughput, namely block transfers, for both memory and i/o, at the cost of latency and determinacy. Interfacing one LED now requires a PCI Express or USB interface, drivers, insane complexity. For people like us who want to use x86 in realtime applications, it's horrible.

--

John Larkin                  Highland Technology Inc
www.highlandtechnology.com   jlarkin at highlandtechnology dot com   

Precision electronic instrumentation
Picosecond-resolution Digital Delay and Pulse generators
Custom timing and laser controllers
Photonics and fiberoptic TTL data links
VME  analog, thermocouple, LVDT, synchro, tachometer
Multichannel arbitrary waveform generators
Reply to
John Larkin

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.