[Prism54-devel] problems in big endian platform - cannot handle oid 0x00000400

Fri May 13 09:16:08 UTC 2005

Hello Durai, Rainer,

Rainer, sorry to bother you, but i'm trying to relate the data that is
reported here to your previous post about the device bug you found. I
didn't understand the problem at the time, but thanks to this i'm
starting to. Some things still seem unclear though, and I would much
appreciate your input.

> Here it is.

In what follows, i'll try to write down what i think is going on in the
code / device. Please correct me if something is wrong. There are also
many questions to which i have no answer.

> irq enabled<7>wlan0: firmware uploaded done, now triggering reset...
> de_put: deferred delete of loading
> de_put: deferred delete of 00:00.0
> IRQ: Init flag, device initialized
> PIMFOR: op 1, oid 0xff020008, device 0, flags 0x0 length 0x4
> [01][01][ff][02][00][08][00][00][00][00][00][04]
> [01][00][00][00]
> Device is in active state

A mgmt frame has been sent to the device.

> IRQ: Update flag
> Received frame in Management Queue

IRQ received with update flag raised, and mgmt rx queue updated.

> displaying buffer - length 16 [00][00][00][04][00][00][00][0a][00][00][00][0a][00][00][00][0a]

We're entering the rx queue loop for treating incoming mgmt frames, but
we have no time to complete the work on this frame as :

> IRQ: Update flag

second IRQ received, while we're still in the first IRQ handler. I don't
understand here; this situation seems impossible to me; yet this seems
the only explanation for what the log indicates, and sticks well to what
is observed after :

> Received frame in Management Queue

We're treating the same frame as above, as device->index_mgmt_rx has not
yet been updated :

> displaying buffer - length 16 [00][00][00][04][00][00][00][0a][00][00][00][0a][00][00][00][0a]
> PIMFOR: op 0, oid 0x00000400, device 13, flags 0xa length 0xa000000
> [00][00][00][00][04][00][0d][0a][0a][00][00][00]
> [00][00][00][0a][00][00][00][09][00][00]
> wlan0: Out of memory, cannot handle oid 0x00000400

We're trying to cope with the frame. Unfortunately the data seems
completely bogus (maybe it's just me, i don't know what to expect) - it
seems the bug you were refering to, Rainer, which i would formulate as :
the device updates the device pointer of the rx queue in the control
block before the data in the corresponding buffer is ready.

>

>From this point, the nested interrupt has returned, and we're resuming
the previous work. unfortunately, the in-place endianness conversion has
been done (it seems completely useless, as well as the memcpy in another
buffer, but that's another matter. Maybe we should unify mgmt and data
queues by using properly skbs in both).

Please also note that device->index_mgmt_rx has been increased, from
under us.

> PIMFOR: op 0, oid 0x00040000, device 13, flags 0xa length 0xa
> [00][00][00][04][00][00][0d][0a][00][00][00][0a]
> [00][00][00][0a][00][00][00][09][00][00]
> frame: header: 80aef800, data: 80aef80c, size: 22
> Wake up Mgmt Queue

This time, due to the double-endianness conversion, and due to chance,
the data size falls within reasonable bounds.

End of loop, we're increasing device->index_mgmt_rx !again!, with dire
consequences.

> PIMFOR: op 1, oid 0xff020003, device 0, flags 0x0 length 0x4
> [01][01][ff][02][00][03][00][00][00][00][00][04]
> [01][00][00][00]
> Device is in active state
> IRQ: Update flag
> IRQ: Update flag

One interrupt for the tx update, one interrupt for the mgmt response.
But we miss this rx'd frame. That's expected, as the nested interrupt
above lead to a double-increase of device->index_mgmt_rx, which is
out-of-sync with the device.

> Device is in active state
> wlan0: timeout waiting for mgmt response
> PIMFOR: op 1, oid 0x10000000, device 0, flags 0x0 length 0x4
> [01][01][10][00][00][00][00][00][00][00][00][04]
> [01][00][00][00]
> Device is in active state
> IRQ: Update flag
> Received frame in Management Queue
>
> displaying buffer - length 16 [a1][4c][31][e9][33][ec][39][cc][31][ff][59][75][30][36][a2][c4]
> PIMFOR: op 76, oid 0x31e933ec, device 13, flags 0xcc length 0x31ff5975
> [a1][4c][31][e9][33][ec][0d][cc][31][ff][59][75]
> [30][36][a2][c4][00][00][00][04][05][02]

I think that's enough for a start. We're bound to see problems after
this, because thanks to the priv->index_mgmt_rx corruption,
cb->driver_curr_frag[ISL38XX_CB_RX_MGMTQ] for instance is
also double-increased in the refill function. This may be benign
device-wise, but can also be considered a bug by it, and i think it well
may be, as the ASCII in the following packet dumps shows something like
prism#! ... anyways we're not in a sane state.

Let's sum up the problems :

1/ the IRQ handler doesn't seem serialized against itself. I'm
completely at a loss to explain this :(

2/ the device may update the device pointer without the data being
actually present in the buffer (to be verified, that could be a bug in
the code too :( ).

Proposed fix :

0/ quick and dirty fix : spin_lock_irqsave the interrupt handler. This
should get rid of 1/ above. real fixes below.

1/ use a tasklet to do the work in case of an "update interrupt". For
two reasons : the amount of code in this path is huge, and should be
deferred. And tasklets are actually garanteed to be serialized against
one another, which would solve the problem 1/ above.

This should also include the irq handler cleanup as proposed by Rainer.

This would certainly allow us to get rid of index_mgmt_rx which is an
unneeded state variable.

2/ use a streaming mapping for the mgmt queue, with skbs, as in the
data_rx path. This would hopefully avoid problems with some data buffers
being changed twice, and would allow code sharing with the data rx path
(so it will not add any code). Streaming mappings are also preferred, it
seems.

3/ investigate the device bug once this mess is cleared up

This is a larger change, though, so to be done carefully.

Any comments / thoughts ?

Durai, which codebase are you using ?
Luis, can i do this in SVN ?

-- 
Jean-Baptiste Note
+33 (0)6 83 03 42 38
jean-baptiste.note at wanadoo.fr