Poor PCI performance during read accesses (in master mode)

Hi all !

thank you for reading this post. I'm experiencing some problems to get good data transfer performance using a PCI core in an FPGA directly linked to a PCI connector in a PC (PCI 32bit-33MHz).

The PCI core and the FPGA seem to not be the reason of the problem.

The FPGA is most of the time acting on the bus as Master, accessing directly in the system SDRAM.

Write accesses to SDRAM are very fast, since I can burst as many words as I want (in my case, 48 words), resulting in a 130MB/s bandwith.

However, read accesses bursts are limited by the target (SDRAM controller or just the PCI arbiter, I don't know) to eight word transfers, resulting in a very poor 50MB/s bandwidth. The target always asserts the STOP# pin after the 8th word transfer resulting in a "disconnect without data transfer".

All my memory accesses (read and writes) are linearly addressed.

Does anyone has an idea of how I can setup my system so I can achieve to have 64-word bursts for Read accesses instead of 8-words ?

My system is an Intel Pentium III 600MHz running under Linux

Best Regards, Uxello

lspci gives me this:

00:00.0 Host bridge: Intel Corp. 440BX/ZX - 82443BX/ZX Host bridge (rev 03) 00:01.0 PCI bridge: Intel Corp. 440BX/ZX - 82443BX/ZX AGP bridge (rev 03) 00:07.0 ISA bridge: Intel Corp. 82371AB PIIX4 ISA (rev 02) 00:07.1 IDE interface: Intel Corp. 82371AB PIIX4 IDE (rev 01) 00:07.2 USB Controller: Intel Corp. 82371AB PIIX4 USB (rev 01) 00:07.3 Bridge: Intel Corp. 82371AB PIIX4 ACPI (rev 02) 00:11.0 VGA compatible controller: Silicon Integrated Systems [SiS] 86C326 (rev 0b) 00:14.0 Network and computing encryption device: Xilinx, Inc.: Unknown device cafe (rev 01)

I setup the FPGA config regs as follows (lspci -vv):

00:14.0 Network and computing encryption device: Xilinx, Inc.: Unknown device cafe (rev 01) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
Reply to
uxello
Loading thread data ...

Writes are always much faster in PCI since most bridges provide FIFOs and allow posted writes.

I wouldn't call 50 MByte/second reads into the PC's SDRAM slow! That's about as fast as I'd expect (based on experience) in a 32/33 machine. What kind of speeds were you expecting and, more importantly, what do you need?

Well, since this is the PC's sdram, they may want to limit the bandwidth you get into it.

--
Ron Huizen
BittWare
Reply to
Ron Huizen

Thank you Ron for your reply:

Probably, but SDRAM can accept bursts larger than 8-words, I do understand the pretty long latency (around 12 clocks), due to the host bridge, but I don't understand the 8-word limitation.

Do you think the limitation is downto the chipset or the sdram and in a more recent PC, I would not have such a limitation ?

The thing is that to be efficient, I have to burst out from SDRAM

1024 32-bit words, and burst into SDRAM 768 words, and this results in a bus monopolization of around 121us for reads, and less than 50us for writes. Which is more accepatble. Moreover, writes are devided in 'small' 48-word bursts, giving a monopolization of just 3us per burst.

The total PCI bus load in my application is: Acquisition card -> SDRAM : input bitrate=100Mbit SDRAM -> FPGA card : transfer bitrate=125Mbit FPGA card -> SDRAM : transfer bitrate=100Mbit SDRAM -> Ethernet card : output bitrate=100Mbit

As you can see, if my 125Mbit (SDRAM->FPGA) takes actually the equivalent of a 250Mbits transfer because of poor latency and short burst for read accesses, the arbiter will have a hard job !

You are probably right, but their might be a meant to disable this bandwidth limitation, your don't think so ? Why can i have a 1Gbit/s in one sens, and only 400Mbit/s in the other sens ?

Regards, Uxello

Reply to
uxello

Is it stopping at the end of a cache block? What PCI command are you using? I forget the details, but there is one that tells the host bridge that you want to do a long read, hinting that it should prefetch the next cache block while you read the current one.

Of course, the system is free to ignore that hint and probably will if your read crosses a page boundary in the RAMs.

--
The suespammers.org mail server is located in California.  So are all my
other mailboxes.  Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's.  I hate spam.
Reply to
Hal Murray

This kind of behaviour is probably determined by the chip set, but is quite understandable. You have to think about it from the point of view of the entire system.

A PCI 32/33 bus is running MUCH slower than the PC is running its SDRAM bus; the PCI is 132MB/s, whereas the internals of the CPU is probably accessing the SDRAM at 200+MHz at 64 or 128bit/clock (so in the range of 3.2+GB/s).

One of the "weaknesses" of the PCI protocol is that the desired length of an access is not signalled at the beginning of the burst. During a burst, the target does not know if the burst is going to be 2 DWORDs long, or 128DWORDs or 1024DWORDs. If the entire system were running at 132MB/s, the PC could "slave" the SDRAM to the PCI - continue to fetch words from the SDRAM until the PCI master says "that's enough". That way, you could get uninterrupted PCI read bursts up to the length determined by the master (unless the target needs to stop - say for a page break, or the arbiter interrupts with a higher priority access). However, in a system like the PC, where the SDRAM is running so much faster, this is extremely inefficient; the internal bus would be idle (wasted) for 90%+ during PCI transfers. No PC design would do this...

So instead, the PCI target for the PC chipset picks an "arbitrary" length for a read access when it arrives. The decision for the length is determined by a bunch of things; if the choice is too long (say 128 DWORDs), then there is a good chance that a lot of SDRAM bandwidth can be wasted; if the target is only requesting 2 DWORDs, then the other 126 must be thrown away. The length chosen should also be "big enough" so that the SDRAM isn't inefficiently used; SDRAMs are very inefficient at single word accesses - it takes a lot of clock cycles to open the page and access the column; once that is done, you can burst data very quickly. Then the chipset needs a FIFO between the SDRAM and the PCI target; gates are relatively cheap but still, you don't want to go overboard.

So, this chip manufacturer chose 8 DWORDs - thats probably a pretty typical choice for this application. Its odd that the chip chooses to do a disconnect without data transfer, rather than a disconnect with the 8th word of data, since this wastes an extra cycle on the PCI bus, but as another poster pointed out, efficiency on the PCI bus is not top on the list of things that are important in the design of a PC.

There is probably nothing you can do about this... It is possible, that you may be able to change this by changing something deep in the PC bios; it is possible that the FIFO is larger, but the BIOS implementer chose only to use

8 DWORDs, but without knowing the chipset registers in detail (something which is probably not available to the public, and even if it is, something that is probably unwise to change), you won't be able to tell. A different chipset could behave differently, but I wouldn't expect any set to be significantly better; a different chipset might do a disconnect on the 8th word (rather than after), or might do 16 DWORDs, or maybe even 32 (although I doubt it), but I would be extremely surpised if any chipset would do more than that due to the limitations decribed above...

Avrum

and

What

need?

you

Reply to
Avrum

The PCI bus Memory Read Multiple command can give one cache-line transfer and prefetch the next or can provide two cache-lines of data depending on the bridge (I've seen both). I scanned the Intel 82443 data sheet and didn't see any elaboration on Memory Read Multiple or on cache-line size.

It may be that the 82443 can handle cache-line sizes of other than 8 Dwords. A PCI peripheral configured by the system can have the cache-line size defined by an 8-bit field in the PCI configuration space. The Xilinx PCI core doesn't automatically monitor this register but it's easy to eavesdrop on the configuration cycle. The bridge might handle larger cache-lines.

Check deeper into the Intel bridge to find out if you can define a cache-line size other than 8 Dwords. It may be information buried in an app note or a users' guide that isn't communicated quickly in the data sheet.

03)
Reply to
John_H

Some variants of the 440xx chipset needed to have a bit set in the configuration space of the host bridge to enable PCI read streaming.

You could try it -- I believe it was bit 1 in config reg 0x50. Search around in google to be sure. I have never tried it myself. It is not clear to me whether this is 100% safe or whether it may not work reliably all the time. I think this is an undocumented bit.

You also want to make sure you are issuing READM or maybe READL commands to signal to the host bridge that you want a lot of data. Again, I don't know if this will help with the 440BX.

A lot of chipsets have relatively poor performance for PCI burst reads.

-Ewan

Reply to
Ewan D. Milne

Ourah !!

Thank you John for your advice ! I did test to do a 'Memory Read Multiple' instead of a 'Memory Read' and now my master reads sdram using very long burst (1024 words) !!

The PCI monopolization for reads is now 31us instead of

120us

I also thank all other people who replied to my post for their expertise !

Best regards, Uxello

John_H wrote:

Reply to
uxello

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.