Hi Tuomas
Comments from Lawrence are correct.
For what concerns optimizing assembly through COFFEE (which is a tool integrated with PyOP2 that does the optimization of local element matrices/vectors evaluations) it is actually true that in general:
parameters["coffee"]["licm"] = True
parameters["coffee"]["ap"] = True
are the key parameters.
However, if your function spaces have relatively high polynomial order
or if your form has a lot of coefficients, then you may be interested in other optimizations as well. Please, let me know if this is the case, in which case I'll try to be more precise.
Said that, our (short-term) goal is to let you (users) abstract completely from playing with these sort of "low level parameters". Just to let you know, we are currently working on an autotuning system that, once in place, should allow you to get significant better run-times while avoiding the need for setting/trying manually individual optimizations. We are not far from that, so we'll keep you posted (should take order of days).
As for the MIC, we have some performance numbers concerning assembly "in isolation", but we (I) have never brought the whole toolchain on the MIC - that is, we have just used it as an accelerator. In particular, code is *not* specifically optimized for these kind of architectures, especially if you are running at low polynomial order (in practice, we are not taking full advantage of the large vector lanes).
As for MKL in assembly kernels, I'm running experiments in these days to quantify how much faster can we go by transforming assembly code into a sequence of BLAS calls. Problem is that at low polynomial order the involved matrices are kind of small. What I have seen so far (but these are early experiments) is that if you are (I'm being vague here, but just to give you an idea) using like polynomial order 3 or 4 and if you are form as some coefficients in it, then turning to BLAS may be a (big) win. I hope I'll be able to report more (convincing) results, and to be more precise, in the next few days.