XILINX PCIe read of slow device

- D
- David Binette
  
  Contact options for registered users
posted
9 years ago

Mon, Oct 27, 2014 6:05 PM

What is the correct way to handle a PCIE request to a slow device?

I have a xilinx spartan 6 PCIe using Integrated Block for PCI Express.

The BAR memory map is decoded and some addresses map to fast ram, or local registers and these work OK, but some addresses map to slow devices.. like I2C or internal processes that need a few cycles to process before they can produce valid data to be returned to the PCI bus.

Is there a way to tell the PCI bus to wait, or retry..?

thanks

- M
- Mark Curry
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Mon, Oct 27, 2014 6:36 PM

David,

What specific problem are you trying to address?

The Completion Timeout Mechanism of the PCIE spec is optional, and must be enable by SW during device configuration.

Can you just disable this? You can force it disable on either end (root complex, or endpoint). I don't think it's enabled by default, but I can't check at the moment...

Or are you asking something else?

Regards,

Mark

- D
- David Binette
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Mon, Oct 27, 2014 6:53 PM

Mark, thanks, I will look into 'completion timeout mechanism' to see if it is the answer to my need. .. Am i asking something else? I don't know, it is all kind of new to me.

part of the difficulty is that the PCI system and the local app are on different clock domains, so when the PCIE read occurs I deal with the clock crossing but it takes clock cycles before I can return something to the PCI read request

- L
- langwadt
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Mon, Oct 27, 2014 11:09 PM

Den mandag den 27. oktober 2014 19.05.32 UTC+1 skrev David Binette:

For peripherals that a slow like I2C on a normal MCU, you would normally have a register to initiate the read, and a status register you can poll to see when the result is ready

-Lasse

- D
- David Binette
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Tue, Oct 28, 2014 2:12 PM

yes, that is a good solution, but for a different problem. In this case, the data is always 'ready' it is continuously changing, on a faster clock domain and I need a couple of cycles for the read request to cross domains.

I've tried unsuccessfully to manipulate the IP cores 'trn_tsrc_rdy_n' line to look at the read address and before setting the start of frame line in an effort to pre-fetch the data, but for some reason the core will not tolerate any delays.

- L
- langwadt
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Tue, Oct 28, 2014 2:37 PM

can't you just keep a copy of the data on the other clock domain?

-Lasse

- D
- David Binette
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Tue, Oct 28, 2014 4:25 PM

yes that is feasible for a small number of items and it my be 'plan-b' if no PCI bus solution is available to me.

I like your suggestions, they are all reasonable and I'll take the best alternative I can get if I dont find a way to do this via PCIe

- J
- Joe Chisolm
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Tue, Oct 28, 2014 10:37 PM

This is out of UG654, page 133, for a simple PIO access. I'm not sure what your host driver might be using.

"While the read is being processed, the PIO design RX state machine deasserts trn_rdst_rdy_n, causing the Receive TRN interface to stall receiving any further TLPs until the internal Memory Read controller completes the read access from the block RAM and generates the completion. Deasserting trn_rst_rdy_n in this way is not required for all designs using the core. The PIO design uses this method to simplify the control logic of the RX state machine."

Also take a look at page 141

--
Chisolm 
Republic of Texas

- S
- Sean Durkin
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Wed, Oct 29, 2014 8:49 AM

I'm still not sure on what exactly your requirement is. In one post you write that you want to read from slow devices (like I2C). That would mean the problem is this:

- you issue a PCIe read request

- this read request triggers something, e.g. a read from an I2C device, which takes a certain time

- meanwhile, you cannot respond to the PCIe read request in time because you haven't received the result yet

In that case, do what Lasse suggests: Have one register to trigger the read and another one that can be polled via PCIe indicating when the data is ready.

But in another post you write "the data is always 'ready' it is continuously changing, on a faster clock domain", which is something entirely different. Is it streaming data? Do you need to catch all the data or do you want to read out only one single value occasionally? Is it dependant on your read, meaning that your read requests initiates a calculation or something that you want the result of, or is the data totally independant and you only occasionally want to read the current value?

Since I don't understand what you really want to do, here's a few other possibilities:

- You could just always transfer the data you have to the PCIe clock domain whenever it changes. Each time there is a new value, always transfer it to the PCIe clock domain immediately and put it e.g. into a BAR register. So when you issue a PCIe read request, there's data already there that you can put into your reply message immediately. Worst case is you don't get the very latest value but the one before that.

- If you need to catch all the values, I'd put the data into a FIFO. You could then e.g. issue an MSI (Message signaled interrupt) when the FIFO is e.g. half-full (or keep polling prog_full or something) and then read it out in a burst from the PCIe side. No need for clock-domain-crossing for the read request, as you only read from the FIFO that has its read port in the PCIe clock domain. No need for PCIe to wait for data too long, since data from the FIFO is available one or two clock cycles after the read request was issued (depending on how you configure the FIFO).

- If in your design the read request itself triggers something that takes a while, do what Lasse suggests.

HTH, Sean

- D
- David Binette
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Wed, Oct 29, 2014 2:13 PM

I understand this "deasserts trn_rdst_rdy_n, causing the Receive TRN interface to stall receiving any further TLPs"

but I'm not so much interested in "any further TLPs' as allowing the current TLP to continue processing, it seems that if i delay even a single extra cycle it causes distress to the linux host.

- D
- David Binette
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Wed, Oct 29, 2014 2:33 PM

Hi Sean, Thanks for the suggestions, but I think what I really need is a way to stall the current TLP to allow the read/access to complete. -- Is it streaming data? Do you need to catch all the -- data or do you want to read out only one single value occasionally? Is

The data is always changing, and only needs to be read occassionally.

-- You could just always transfer the data you have to the PCIe clock -- domain whenever it changes. Each time there is a new value, always -- transfer it to the PCIe clock domain immediately and put it e.g. into a -- BAR register. So when you issue a PCIe read request, there's data -- already there that you can put into your reply message immediately. -- Worst case is you don't get the very latest value but the one before that.

That would be OK for most cases but some reads have side effects , such as clearing another register upon read. This could be overcome and is not a show stopper, that part could be redesigned. also since the external device has a lot of registers and they are typically accessed by setting their address and reading the result (sometimes a calculated result) it would require significant changes to create a bank of shadow values to capture them all for instantaneous retrieval instead of indexed on-demand access

How do other ppl handle things like doing SMBus reads over PCIe or an I2C device.. the first read is certainly going to need some time to complete before it can return data.

Perhaps I just fumbled something during my tests and subsequently discarded what should have been a viable approach.

If I knew exactly how it should be done I could focus my efforts on that.

- D
- David Binette
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Wed, Oct 29, 2014 2:47 PM

ps, i know that SMBus is an independant bus on the PCIe connector, I don't mean to complicate the topic with that. It was an example to illustrate only.

- C
- Chris Higgs
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Wed, Oct 29, 2014 4:34 PM

It's generally best to avoid side-effects if at all possible and make all r eads idempotent. Life is much easier for software that way.

For example, TLPs may be re-ordered, accesses above a certain size may not occur in the order you expect, the root complex may attempt to pre-fetch a value, in future you may be using this device over a lossy medium like Ethe rnet.

All of these things can be controlled (or worked around) in software but of ten lead to inefficiencies. If you have the choice, it's always better to design your interface with a view to simplifying the software interaction. This generally also yields simpler hardware and fewer gotchas in the docum entation so everyone's a winner!

Thanks,

Chris

- M
- Mark Curry
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Wed, Oct 29, 2014 6:35 PM

David,

I can't offer any specific advise - but generally all PCIE transcations are "stalled", whether they're reading from a slow device on another clock or a "fast" device on the same clock.

For A PIO read you get: 1. The host issues a PIO read. 2. A TLP MRd packet is formed and sent across the serial interface. 3. The xilinx endpoint decodes the packet, determines that the packet is meant for the user logic - you. It sends the information out to the user interface logic. 4. Your logic issues the read, and responds. 5. The CPLd packet is formatted and transmitted back across the PCIE link. ...

All of that takes quite a bit of time. The fact that step 4 takes a few cycles (give or take 10s or perhaps even 100s) is almost irrelavant. The PCIE time mechanism doesn't come into play until this number is very high (I've not used it, but I'd think we're talking 10s of ms)

The whole process has quite a bit of latency. A few cycles here or there aren't going to matter.

I don't use that specific PCIE core, nor Xilinx logic (I'm using the Virtex7 core, with AXIS interfaces tied to my logic). But the general flow should be the same. I'd review the interfaces specification to fully understand what's required. Are you running sims with the Xilinx logic?

Regards,

Mark

- D
- David Binette
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Thu, Oct 30, 2014 12:38 PM

Thanks Mark for your time and comments, which were helpful.

I haven't put it on the simulator, just doing compiles and tests but the turn time is long.

- P
- Petter Gustad
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Fri, Oct 31, 2014 10:58 AM

Does Xilinx provide a realistic Root Complex model or some other type of PCIe verification environment?

Rolling your own can be some amount of work. However, it might be possible to instantiate a Xilinx Root Complex in your testbench and use that to stimulate your DUT.

//Petter

--
.sig removed by request.

- K
- kkoorndyk
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Fri, Oct 31, 2014 6:09 PM

e turn time is long.

Yes, the example design provided with the PCIe EP Block contains a root por t model.

I've recently worked a Spartan 6 design similar to the OP in which the FPGA is a bridge between the processor over PCIe and a local bus with several p eripherals. I started with the example design and modified the PIO Rx and Tx engines to work for my application. Most of the local bus cycles are fa st enough that software is not having to wait. A timeout was implemented o n the local bus cycles that issues an MSI interrupt on the PCIe link if the peripheral doesn't respond within the timeout period (~1 us). One issue w e ran into WRT PCIe packet timing is that the MSI interrupt was not being s een by software before the next transaction was issued on the link. We end ed up using a status register for software to poll instead.

- L
- Luis Benites
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Tue, Mar 9, 2021 2:52 AM

On Monday, October 27, 2014 at 11:05:32 AM UTC-7, snipped-for-privacy@gmail.com wrote :

l registers and these work OK,

hat need a few cycles to process before they can produce valid data to be r eturned to the PCI bus.

Just in case this is what you are trying to so: stalling your whole system and all other PCIe accesses to wait for an i2c read should never be the sol ution to anything. You send your completion whenever it's ready. If it take s you longer than the spec to complete then you need to initiate the read i n some other way (earlier), check for ready and only then issue the read yo u can complete on time.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Tue, Mar 9, 2021 6:56 PM

Please look at the date of the post you are replying to. Do you think someone will have been waiting over six years for an answer to a Usenet post? It's nice that you are trying to help, of course.

- L
- Luis Benites
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Mon, Mar 22, 2021 6:03 PM

Ha ha. Let's start a flame war over trying to help. Don't you have better use of your time? Anyone looking for CURRECT PCIe help with a google search will come across this post and get something from it. Nothing that was said is outdated.