DMA w/ Xilinx PCIX core: speed results and question

B

Brannon King 22 years ago

Params: Xilinx's PCIX core for PCI64/PCIX at 66MHz

2v4000-4 running the controller core with 40 Fifos (10 targets, 2 channels, r/w) and a busmaster wrapper Tyan 2721 MB w/Xeon 2.6GHz w/ 4GB RAM Win2k Server sp4 No scatter/gather support in driver Exact same software and hardware for both reads and writes Bus commands 1110 and 1111

Results: Max host write speed: 70MB/s Max host read speed: 230MB/s Development time: six months w/ two engineers for both driver and core wrapper

The timer does not include the memory allocations. Any ideas why the write speed is so much slower? Would it be the latency parameters in the core? An OS issue?

Vote

E

Eric Crabill 22 years ago

Hi,

When you say "write speed" do you refer to your device becoming bus master and doing memory writes to the system RAM behind the host bridge? Likewise, by the term "read speed" do you refer to your device becoming bus master and doing memory reads of the system RAM behind the host bridge?

I just want to make sure I didn't mis-interpret your question before I try to answer it. Or did I get it backwards?

Eric

Vote

B

Brannon King 22 years ago

To clarify one issue, host write refers to DMA busmaster read (the busmaster is on my device and is actually reading the data in from the host.)

channels,

An

Vote

M

Mark Schellhorn 22 years ago

Is the bus operating in PCI or PCIX mode? If it's in PCI mode then you are seeing the disadvantage of not being able to post read requests. Your device is getting told to retry while the chipset fetches the read data.

If it's in PCIX mode then you should make sure that your DMA engine is issuing as many posted read requests as possible of as large a size as possible.

Mark

Brann> To clarify one issue, host write refers to DMA busmaster read (the busmaster

Vote

E

Eric Crabill 22 years ago

I think Mark described it well in his post. If this is PCI mode, it isn't entirely surprising. If this is in PCI-X mode, and you are using split transactions (supporting multiple outstanding is best) then you may need to do some hunting.

The best tool for this is a bus analyzer, if you have one (or maybe can borrow one from a vendor to "evaluate" it?) There could be all manner of secondary issues that cause problems:

bus traffic from other agents
you are behind a bridge
your byte counts are small

Sorry I don't have a more specific answer for you. Eric

Vote

B

Brannon King 22 years ago

For those speed tests the device was in PCI mode. I was assuming it would be the same speed as PCIX (at the same bus speed) because the timing diagrams all looked compatible between the two. Please explain what you mean by "post read requests". Is there some workaround for this to make the PCI mode handle this better?

device is

issuing

busmaster

write

Vote

A

Andy Peters 22 years ago

Have you used a PCI bus analyzer to see the bus traffic?

Is the write data sourced from cache, or is it being fetched from main memory?

Vote

M

Mark Schellhorn 22 years ago

Actually I shouldn't have called them "posted reads". Posting a transaction means that the initiator never gets an explicit acknowledgement that the transaction reached its destination (like posting a letter in the mail). PCI writes are posted. A PCI read by definition is non-posted because the initiator must receive an acknowledgement (the read data).

What I should have said was that the PCI-X protocol allows the initiator to pipeline reads. If you have a copy, the PCI-X spec explains it pretty well. Here's the short version:

In PCI-X, the target of a transaction can terminate the transaction with a split response, which tells the initiator that the target will get back to him later with a completion transaction (data if it's a read). The request is tagged with a 5-bit number that will come back with the completion so that the initiator can match completions to outstanding requests. The initiator is allowed to have up to 32 split requests outstanding in the pipeline at any one time. Each read request can be for up to 4kB of data. The throughput of a system that takes full advantage of split transaction is highest when the amount of data being transferred is large and the latency is small enough that 32 tags can keep the pipeline full.

In PCI, the target of a read transaction must either respond with data immediately, or repeatedly terminate the read attempts with retry while he goes off and fetches the data. Once he's fetched it, he will be able to respond immediately to the initiator on the initiator's next attempt. This is very inefficient because there is only one transaction in the pipeline at a time. If the latency is large (the initiator has to retry many times), the throughput is much lower than when pipelined reads are used.

If PCI-X mode is available, use it. Or, there may be chipset settings that you can use to improve PCI mode performance. The chipset may be able to do pre-fetching of data in anticipation of you reading it. There may also be burst length settings that allow you to increase the amount of data transferred in a single transaction. You need to read the specs for the chipset you are using and figure out what can be tweaked.

Mark

Brann> For those speed tests the device was in PCI mode. I was assuming it would be

Vote

B

Brannon King 22 years ago

As it seems a valuable response, here is Eric's answer:

Hi,

In PCI mode, when you try to "read" the host, most hosts

will immediately issue retry. However, they have gleaned

some valuable information -- the starting address.

That is called a "delayed read request".

Then, the host goes off and prefetches data from that

starting address. How much it prefetches is up to the

person that designed the host device. Probably 64 bytes

or something small like that.

While it is prefetching, if your device retries the read,

you'll keep getting retry termination. Time is passing.

Eventually, when the host is finished prefetching however

much is is going to prefetch, and you return to retry

the transaction (for the millionth attempt) it will this

time NOT retry you but will give you some data (from one

DWORD up to however much it prefetched...)

That is called a "delayed read completion".

If that satisfied your device, the "transaction" is over.

If you actually wanted more data (the host has no idea

how much data you wanted, since there are no attributes

in PCI mode) your device will get disconnected. Then,

your device will start a new "transaction" with a new

starting address, and this horrible process repeats.

It is terribly inefficient (but supposedly better than

having the host insert thousands of wait states, which

keeps the bus locked up so everyone else is not getting

a turn...)

This is replaced by something called split transactions

in PCI-X mode, which is more efficient. It is a bit more

complicated to explain, though. If you want me to give

that a stab, write back and I'll give it a shot tomorrow.

Eric

Vote

DMA w/ Xilinx PCIX core: speed results and question

Join the Discussion

Didn't find your answer?