# Lab 4 - Code Optimization

The deadline has been extended by 24 hours. Lab 4 is now due Friday, October 26 at 11:59pm.

## Evaluation

• Instructions for final handin: For final handin, you will need to submit the final version of your files to the timing server before the deadline. You should ensure that the last files you submit contain your best version (the one you want to be graded on) for each of the three parts.
• Rules for full and partial credit: Points for this lab will be assigned according to the formulas listed below. In the formulas, p stands for points and s stands for the mean speedup or score that you get. Note that the numbers for individual sizes are posted in the table below just so that you can compare your results with these. (Note also that the formulas below have taken into account the variance observed in the CPEs for the same implementation over a few runs.)
• Part I:
• s <= 1.0: p = 0
• 1.0 <= s <= 2.6: p = 15*(s-1.0)/1.6
• 2.6 < s: p = 15
• Part II:
• s <= 1.0: p = 0
• 1.0 <= s <= 2.4: p = 25*(s-1.0)/1.4
• 2.4 <= s <= 2.9: p = 25 + 20*(s-2.4)
• 2.9 < s: p = 35
• Part III:
• s <= 1.0: p = 0
• 1.0 <= s <= 6.2: p = 40*(s-1.0)/5.2
• 6.2 <= s <= 7.3: p = 40 + 10*(s-6.2)/1.1
• 7.3 < s: p = 50
• The numbers in the handout will not be used for evaluation purposes. Those numbers were just so that you get an idea of what's achievable.
• Some performance numbers for Part I are:(numbers refer to the ratio of naive miss score to the optimized one.)  64 128 256 512 1024 Mean 1 3.5 3.5 3.37 2.92 2.61
• Some performance numbers for Part II are:(all numbers refer to speedups)  64 128 256 512 1024 Mean 1.84 4.81 3.14 3.09 3.86 3.19
• Some performance numbers for Part III are:(all numbers refer to speedups)  32 64 128 256 512 Mean 7.76 7.64 7.62 6.97 6.83 7.36
• Group Registration: To use the timing server, and for final handin, you will first need to register your group here. Note that your team login name must be of the form id1+id2, just like the team name you used in Lab L3 to generate your cookie. The order of ids in the team name must be the same as in the struct team in the source files that you submit.
• Timing Server Instructions: After registering your group, you can upload your files to the timing server by clicking here. Please upload your files only after testing and debugging them on a local machine. The form will ask you to upload 3 required files (smooth.c, rotate.c, rotate_cache.c) and 1 optional file (a dump-file to run with driver using the -f option). You may order the files in any way; however, you must name the three required files with the above names -- otherwise, your request will not be served. Each team can submit only one request at a time.

• Oct. 22
• For Parts I and II, for larger image matrix sizes, you will encounter more conflicts among elements of the same matrix, since these are more spaced out in the address space.

For example, consider a 1024 x 1024 matrix. Each row has 1K pixels, each pixel being 6 bytes wide. Thus each row of this matrix is 6 Kbytes wide. Consider the pixels in column i. If src[i][j] is at address A, then src[i+1][j] is at address A+6K, src[i+2][j] is at address A+12K, and so on.

The L1 cache we're targetting is of size 16 KB, and is 4-way associative. This means that addresses A and A+m*4K will map to the same set, where m is an integer.

In particular, src[i][j] at address A, and src[i+2][j] at address A+12K will map to the same set. A set can hold at most 4 cache lines. Therefore, if more than 4 blocks of the matrix map to the same set, the last one accessed is likely to evict the first. If you now want to access the first block again, you will get a miss and pay a performance penalty.

For smaller matrices, this effect is not so pronounced, because the locations within the same matrix are comparatively less far apart. For example, for size 64, if src[i][j] is at address B, then src[i+2][j] is at address B+868.

• For Part II, you will notice that read misses and write misses have different costs. Note that you are reading from src, but writing to dst. So you should decide how to access elements of both arrays so as to achieve the best tradeoff between read and write misses.
• Oct. 21
• For final handin, you will need to submit the final version of your files to the timing server atleast once. Your best scores for Part II and III are recorded/updated each time you submit a job to the timing server, and your grades for those parts will be based on these recorded scores. In addition, you should ensure that the last files you submit contain your best version (the one you want to be graded on) for each of the three parts.
• Oct. 19
• cdriver and miss_score have been updated. The earlier versions allowed some buggy code to pass correctness checks without reporting a warning.
• Here are some hints for tackling this assignment, in the form of techniques you can use:
• Blocking: This is useful to optimize cache performance. You will need to find the right block size to use. Section 6.6.3 of the textbook describes this.
• Eliminating branches: Branches reduce performance, as they make it harder for the processor to speculatively execute instructions. See how you can reduce the number of branches.
• Loop unrolling is a useful technique to reduce branching.
• Reduce expensive arithmetic, especially in the innermost loops.
• Note that the arrays src and dst are contiguous, with dst starting where src ends. Both are aligned on cache line boundaries.
• Oct. 17
• Note that you are free to use different strategies for different image sizes, so long as your code works correctly for all sizes that are a multiple of 32.
• Oct. 16
• If your code runs correctly on a local machine, but your performance numbers vary quite a bit, you should submit it to the timing server. Please follow the instructions above.
• Remember that you're allowed to implement rotate and smooth in any way so long as it is correct. You can choose your own algorithm.
• If your code fails the correctness test, it could be due to two reasons. Either your algorithm is wrong, or your implementation is wrong. To check if your algorithm is correct, try to work it out for a small value of dim first. If your implementation is the problem, you can use gdb or other debugging techniques. To use gdb, you will need to edit the Makefile, comment the line CFLAGS=-Wall -O2 \$(DEFINES) and uncomment the line #CFLAGS=-Wall -O2 -g \$(DEFINES).
• Oct. 15
• You may assume that the src and dst arrays start at addresses that are a multiple of 32 (i.e., they are aligned on cache line boundaries).
• The L1 cache on the Fish machines is a write-allocate cache; the simulator also tracks both read and write misses.
• Oct. 11
• The script miss_score has been updated to print appropriate output if your code crashes. You might want to update this if you copied the files before 4pm.