Disclaimer: the things I am writing about are new to me. Although I do my best to have a solid (well, decent in this case) understanding of the covered topics, some inaccuracies might have slipped through. Feel free to point them out in the comments.
Hey guys, or anyone who still happens to visit this blog. First of all, thanks for the Pwnie Awards nomination!
I’ve been recently toying with the idea of learning some electronics that’s not necesarilly related to desktop PCs or the Intel X86(-64) architecture. Finally, the choice fell on programming AVR microcontrollers, or specifically playing with the Arduino Uno board (including an ATmega328 unit) and its many optional shields which can make a project physically functional. As you can see, there’s totally nothing out of ordinary going on, but it’s a start and I’ll probably take the chance to share anything that takes a few hours to learn :-) My first idea for the very initial project was to write an optimized MD5 hash function implementation and create a simplistic bruteforce password cracker. Because of the very limited CPU resources (16MHz) and the fact that avr-gcc 4.3.3 turned out not to be so clever about generating efficient code, I decided to learn some of the AVR architecture, make a first bunch of silly mistakes and write the code in assembly.
As a consequence, I created a very basic md5 library, with the public interface (md5_init, md5_update, …) in C and the computation-heavy part calculating the hash itself (md5_transform) in assembly; it is the subject of this post. As far as I can tell, the code should work on any of ATmega48PA, ATmega88PA, ATmega168PA, ATmega328P MCU and potentially some others. It’s been thoroughly tested both manually and automatically against another third-party implementation, but I do not guarantee that it would always work as expected. Use at your own risk.
Compilation:
$ avr-gcc -Os -DF_CPU=16000000UL -mmcu=atmega328p -c -o md5.o md5.c md5.S
The implementation can be verified using an included md5test.c file, which performs 8 arbitrary hard-coded tests and returns 0 on success, 1 otherwise:
(gdb) file md5test Reading symbols from md5test...(no debugging symbols found)...done. (gdb) target remote localhost:1212 Remote debugging using localhost:1212 warning: Can not parse XML target description; XML support was disabled at compi le time 0x00000000 in __vectors () (gdb) load Loading section .text, size 0x1fbe lma 0x0 Loading section .data, size 0x1b8 lma 0x1fbe Start address 0x0, load size 8566 Transfer rate: 269 KB/sec, 856 bytes/write. (gdb) b exit Breakpoint 1 at 0x1fba (gdb) c Continuing. [New Thread 1] Breakpoint 1, 0x00001fba in exit () (gdb) info reg r24 r24 0x0 0
A full package with source code is available for download here (avr_md5.zip, 9kB). The project is open-source, licensed under MIT license.
For some technical details, see the next section.
Benchmarks
An optimized function implementation would obviously be useless if we were unable to measure how good it really was. Thanks to the 16-bit precision timer available in the MCU (along with two other 8-bit timers), we are able to measure the exact amount of cycles it takes for the code to run. There is an excellent introduction to using AVR timers in C code on the maxembedded.wordpress.com blog – you should definitely check it. The technical details in follow-up posts are not 100% accurate in terms of ATmega328, but it is easy enough to figure out the necessary changes, based on header files and the datasheet. Well, you can also just read another valuable article at “Timers on the ATmega168/328”.
Here’s how my simple benchmark was implemented:
volatile uint32_t clock_count; void timer1_init(void) { // disable interrupts cli(); // zero out the counter TCNT1 = 0; // initialize clock count clock_count = 0; // set normal mode, no prescaling TCCR1A = (0 << WGM11) | (0 << WGM10); TCCR1B = (0 << WGM13) | (0 << WGM12) | (0 << CS12) | (0 << CS11) | (1 << CS10); // enable overflow interrupt TIMSK1 |= (1 << TOIE1); // enable interrupts sei(); } uint32_t timer1_stop(void) { // disable interrupts cli(); // disable timer TCCR1A = 0; TCCR1B = 0; return clock_count + TCNT1; } ISR(TIMER1_OVF_vect) { clock_count += 0x10000; }
Having a reliable execution time measurement method, I decided to confront three different implementations: the one covered in this post (-Os), the example C code presented in md5 RFC 1321 (-O3) and the avr-crypto-lib implementation (-O3). The numbers presented below denote the total number of cycles it took to compute a hash of buffer up to one, four and eight 512-bit blocks:
block# / implementation | my md5 (Os) | rfc1321 md5 (O3) | avr-crypto-lib C md5 (O3) | avr-crypto-lib assembly md5 (O3) |
---|---|---|---|---|
1 | 4756 | 18337 | 23966 | 18684 |
4 | 16527 | 71935 | 89694 | N/A |
8 | 32207 | 143385 | 178896 | N/A |
As shown, my custom implementation is approximately 5x faster than equivalent C code or avr-crypto-lib assembly implementation (both compiled with -O3), and 5.5x faster then avr-crypto-lib C code. The natural question in this situation is – what is the difference between the three pieces of executable code that makes the latter two so much slower? There are two things:
1. gcc-generated code makes extensive use of stack, thus triggering lots and lots of cycle-count heavy instructions:
[...] .text:00000064 lds r6, loc_108 .text:00000066 lds r7, loc_109 .text:00000068 lds r8, loc_10A .text:0000006A lds r9, loc_10B .text:0000006C std Y+0x39, r6 .text:0000006D std Y+0x3A, r7 .text:0000006E std Y+0x3B, r8 .text:0000006F std Y+0x3C, r9 [...]
My implementation, on the other hand, persforms the overall set of transformations by operating solely on registers – memory loads and stores are used only at the beginning and end of the procedure in order to read/write md5 state, and during the computations to read input data. Now, according to the spreadsheet, almost all data transfer instructions take from 2 to 3 cycles, with the exception of mov (Move Between Registers), movw (Copy Register Word), ldi (Load Immediate) and port operations. The first two are used in my assembly almost exclusively, thus saving a lot of execution time.
2. Every transformation out of 64 total for a single 512-bit block includes a “rotate left” operation. Sadly, AVRs don’t have an equivalent of a convenient Intel rol instruction – the only thing available is lsl (shift left by one) with the semantics of “add reg, reg” and rol (rotate left by one) with the semantics of “adc reg, reg”. Let’s see what gcc compiles a C rotation (written as ((x << s) | (x >> (32 – s)))) to:
.text:0000013F ldi r31, 0xC .text:00000140 .text:00000140 loc_140: .text:00000140 lsl r6 .text:00000141 rol r7 .text:00000142 rol r8 .text:00000143 rol r9 .text:00000144 dec r31 .text:00000145 brne loc_140
That’s right. In most cases, gcc simply generates a one-by-one loop using four instructions for the actual rotation (to emulate 32-bit integers), which eventually produces the desired result, but is really slow. Each of the instructions used takes 1 cycle to execute with the exception of brne, which might take 1 or 2 (let’s assume 1). This gives us 186 cycles for worst case scenario of a 31 bit rotation. Obviously, there is room for much optimization here: 16-bit rotations can be implemented with three movw’s (three cycles), 8-bit rotations can be implemented with five mov’s (five cycles), and five-or-more rotations can be implemented by a clever combination of mul’s, mov’s and or’s (twenty cycles). After applying these trivial improvements, we end up with a left rotation that executes in no longer than 28 cycles (check yourself!). On a related note, AVRs with the mul instruction provide amazingly fast 8×8=>16-bit multiplication in only two clocks – something you (or at least me) don’t see very often :)
So… that’s about it. For more details, feel encouraged to investigate the sources by yourself (although there’s probably not so much left). If you have other ideas of what could still be optimized or improved, ping me. More coverage of my new experiences and perhaps some Windows-related material is coming soon. Stay tuned and don’t hesitate to comment. Cheers!
I really really do not understand :)
Hey, what is it that you don’t understand? ;)
because I am still very newbie, in my opinion it is the level of master, I am really still very newbie :)
Congrats to the winner of Pwnie Awards :) Gratualacje! :)
Thanks ;)
man thanks for the explanation and sources. that makes me more anxious about learning low-level languages even tough I only started learning programming in python, c++ and shell scripting about 8 months ago but I hope it doesn’t take me too long to understand the basics
With all loops unrolled, you can really optimize, but it does cost you code size. I have your md5_transform at 6874 bytes (including a call from main). I have size constraints in my project, so I wrote my own that comes in at 740 bytes. Using the simulator in Atmel Studio 7, yours took 4129 clocks for one call; mine took 8075. So yours is roughly twice as fast for 10 times the code.
One further optimization you could make is on rotates of 7 (15, 23) and maybe 6: rotate an extra byte then rotate back one bit (or two). My rotate loops actually use this on 5 bits or more (forward 1 byte, back 3 bits).
I didn’t use your makefile. How do you not have a collision on object file names for md5.c & md5.S?
I did learn from reading your code. I didn’t know you could use a number for a register, and thus calculate a register number in a macro.