hardware prefetching

16 Feb 2014

      We discussed prefetching in the MAPDES meeting the other day.  I would love to understand hardware prefetching better.  I am, after all, supposed to teach it.

Here's one interesting comment:
http://www.tomshardware.com/reviews/Intel-i7-nehalem-cpu,2041-12.html
(scroll down a page)
Basically: prefetching can help a lot and sometimes hurt a bit, but Intel are not saying much about how they fix it.

One thought: even if prefetching cannot be improved using explicit instructions, it might be possible to do something usefull with non-temporal stores?
Eg see http://blogs.fau.de/hager/archives/2103

But here is what Intel *does* say, about Sandy Bridge but I imagine other processors are similar:

Data Prefetch to L1 Data Cache

Data prefetching is triggered by load operations when the following conditions are met:

• Load is from writeback memory type. [[ie not on direct-mapped I/O pages, and perhaps pinned]]
• The prefetched data is within the same 4K byte page as the load instruction that triggered it.
• No fence is in progress in the pipeline.
• Not many other load misses are in progress.
• There is not a continuous stream of stores.

Two hardware prefetchers load data to the L1 DCache:

• Data cache unit (DCU) prefetcher. This prefetcher, also known as the streaming prefetcher, is
triggered by an ascending access to very recently loaded data. The processor assumes that this
access is part of a streaming algorithm and automatically fetches the next line.

• Instruction pointer (IP)-based stride prefetcher. This prefetcher keeps track of individual load
instructions. If a load instruction is detected to have a regular stride, then a prefetch is sent to the
next address which is the sum of the current address and the stride. This prefetcher can prefetch
forward or backward and can detect strides of up to 2K bytes.

Data Prefetch to the L2 and Last Level Cache

The following two hardware prefetchers fetched data from memory to the L2 cache and last level cache:

Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to the L2 cache with
the pair line that completes it to a 128-byte aligned chunk.

Streamer: This prefetcher monitors read requests from the L1 cache for ascending and descending
sequences of addresses. Monitored read requests include L1 DCache requests initiated by load and store
operations and by the hardware prefetchers, and L1 ICache requests for code fetch. When a forward or
backward stream of requests is detected, the anticipated cache lines are prefetched. Prefetched cache
lines must be in the same 4K page.

The streamer and spatial prefetcher prefetch the data to the last level cache. Typically data is brought
also to the L2 unless the L2 cache is heavily loaded with missing demand requests.

Enhancement to the streamer includes the following features:

• The streamer may issue two prefetch requests on every L2 lookup. The streamer can run up to 20
lines ahead of the load request.

• Adjusts dynamically to the number of outstanding requests per core. If there are not many
outstanding requests, the streamer prefetches further ahead. If there are many outstanding
requests it prefetches to the LLC only and less far ahead.

• When cache lines are far ahead, it prefetches to the last level cache only and not to the L2. This
method avoids replacement of useful cache lines in the L2 cache.

• Detects and maintains up to 32 streams of data accesses. For each 4K byte page, you can maintain
one forward and one backward stream can be maintained.

Furthermore, for Ivy Bridge there is this enhancement (§2.2.7):

Hardware prefetch enhancement: A next-page prefetcher (NPP) is added in Intel microarchitecture
code name Ivy Bridge. The NPP is triggered by sequential accesses to cache lines approaching the
page boundary, either upwards or downwards.

(Source: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32...
§2.2.5.4).

Kelly, Paul H J

tags

participants (1)