ic3qu33n

ARM Assembly Intro

Learning ARM Assembly through Idiomatic Usage

or

“What do you mean you don’t understand me??! I said exactly what Google Translate told me to say!”

*****************

An Introduction to ARM Assembly

*****************

So, you made a New Year's Resolution to learn a new language and it's already March and you're thinking... "How is it that I can practice on Duolingo everyday and I still fumble my way through a sentence when somebody asks me 'how are you?' in my target language and the only thing I can say in response is a collection of pseudorandom vocab words that aren't even in the right tense for the situation?"

Don't worry, dear reader. I too, have responded "wódka... mleko... jabłko... dobrze" when asked by an acquaintance "Jak się masz?" and I am here to help you avoid making the same mortifying faux pas.

"I thought this was an ARM Assembly tutorial..." It is!

Rather than approach this by repeating definitions of specific opcodes or registers or instructions — I refer the reader to the appropriately marked references in the References Section below for that information —- I am instead choosing to adopt a different didactic technique.

Let’s learn ARM Assembly by approaching it as a study in language learning. As in the case of learning natural languages, knowing the component parts is important, but it is just as important to understand how to put those parts together, how to assemble those modular pieces into a coherent whole.

*****************

References for mastering the ABC's of ARM

*****************

I refer the reader to two exceptional tutorials online:

Both of these tutorials are detailed and comprehensive and are fantastic references that I have consulted while learning ARM Assembly. They also go into more detail than I will here in covering the aforementioned basics.

But what if you’re thinking, “I really don’t want to toggle between tabs just to remember what a branch instruction looks like in ARM, can’t you just have that on this page?” Fret not, dear reader, I’ve included images that list those basics here. The images in the next section are scans of my notes that I took while reading the Azeria labs tutorials religiously (I cannot recommend that resource enough, it’s a work of art. Those animations! That VM walkthrough! It is *chef’s kiss* a masterpiece.)

******************

ARM Assembly Idioms and Constructs

******************

As an example, if you were learning English, it’s important to know the letters of the alphabet, so that you can form words. Then, once you can recognize letters, you can understand how to arrange those letters to form words that make sense, that have meaning defined in the parameters of your language (often the standard of which is set by an authority like Merriam-Webster for compiling the dictionary of all valid words in a language. The word valid is important here — valid in this context, refers to whether or not a word satisfies the criteria that is agreed upon as a standard of available possible words.)

*****************

A brief sidetrack on the notion of a “valid” word

*****************

For example, the word twah does not have a meaning in English, per Merriam-Webster; it may be defined in one context by a certain group (your friend group might use it as an acronym for “too weird alright, honey?”), which would mean it had a valid use in a specific context, but not to all speakers of that language.

This is like the concept of local variables and global variables. A local variable is defined in one specific context, but does not have any meaning or the same meaning outside of that specific context. A global variable however, is defined for all contexts — it has a standard value and it is up to any specific user to modify, adapt, change that value based on their needs or interpretations.

However, the word “toi” has a meaning in French — as the word “you” — and the pronunciation of that word, makes it sound nearly identical to the Anglicized spelling of that sound “twa.” Here, we could consider sound to be the bits of the word, and the letters to be their “encoding,” or their value defined using a set of specific parameters.

*****************

Which is a long-winded tangential way of saying that understanding the component parts, the rules for what is and what is not a valid word (i.e. a valid instruction), is an important and essential part of understanding any language.

However, once one has that knowledge, the next step is to understand and recognize how to use those component parts to create more interesting things. Despite the high levels of redundancy in language (thanks, Claude Shannon ♥), there are still common configurations, and there is variability in sentences, and sentence structure.

Essentially, you want to be able to say what you want to say.

Language learning is akin to pattern matching — using the rules for a set of specific patterns, one identifies such patterns as they are encountered.

So, let’s recognize some patterns.

I’ll present code constructs in a higher-level language — C — and then present the equivalent representations of that program in ARM Assembly.

I have found that this is easier to gain “fluency” in Assembly languages, or at least a more confident proficiency, by following this approach. Of course, this is not the only way to approach learning ARM assembly, or any language, for that matter, but I have found it to be a really helpful approach, so perhaps you will too.

For anyone who has ever studied a language in a formal classroom setting, this may make more intuitive sense. It is impossible to translate complex sentences from one language to another — and, this is the important part, still maintain the meaning of the original sentence — by translating word by word. If one does this, in Google Translate (pre machine-learning improvements to Google Translate) or in an old-school honest-to-God physical book-form dictionary, the end result won’t have the same meaning as the original in a majority of instances.

Why is that?

Because individual languages may be defined as stochastic processses, but language as a whole (the set of all languages) is not stochastic by the same rules that define the patterns of one language — words change meaning in relation to other words, and common structures in one language are used to convey one meaning but those structures may not be present in another language, and to attempt to use them in a one-to-one mapping might not result in the same meaning being expressed in another language

The notion of translation is important, as the function of a compiler is essentially that — to serve as a “translator,” between a higher-level programming language and an assembly language of a machine. As in natural languages, a translator can change the end result (the assembly language), by applying different rules.

*****************

Contents of Github Repo

*****************

For these examples, I used programs in C that I had written for technical interview practice exercises on HackerRank, and compiled and disassembled each program on a Raspberry Pi, running on an ARMv7 core. Of course, the output for the assembly (or disassembly) will depend upon different factors, including the architecture of the machine, the assembler or disassembler used, and, in the case of disassembly, the type of disassembly pass implemented (i.e. linear or recursive disassembly).

Machine specs: Raspberry Pi, Model B, running Kali Linux; ARMv7 architecture

For each program, I generated the ARM executable file for each C source code program, by compiling with gcc using one of four optimization levels (O0, O1, O2, O3) and then disassembling the resultant executable using objdump. Compilation was performed with a Makefile and a compile script, which are both included in the Github repo for this page.  The Makefile and the compile script are essentially equivalent (the compile script is actually more verbose), and my explanation for the redundancy is that honestly I just like writing bash scripts and this one was fun.

Github repo with demo scripts: ARM_assembly_c_idioms

Since the disassembly of a program can also vary depending on compiler optimizations used at compile time, I have included five files produced as output from assembly and/or disassembly of the same program:

For all programs in this repo, let example.c represent a member of that set, and let the following five files be derived from that input example.c source

For the sake of simplicity, the write ups in the C Idioms section will all use the disassembly of the compiled program with the default -O0 optimization. The executables and disassembled ARM assembly for optimization levels O1-O3 are available in the aforementioned GitHub repo. Feel free to peruse them at your leisure.

  • example_O0

    the compiled ARM binary executable file compiled using gcc with -O0 flag, from source example.c; gcc {$FLAGS} example.c

  • example_O1

    the compiled ARM binary executable file compiled using gcc with -O1 optimization, from source example.c; gcc {$FLAGS} -O1 example.c

  • example_O2

    the compiled ARM binary executable file compiled using gcc with -O2 optimization, from source example.c; gcc {$FLAGS} -O2 example.c

  • example_O3,

    the compiled ARM binary executable file compiled using gcc with -O3 optimization, from source example.c; gcc {$FLAGS} -O3 example.c

  • dis_example_O0.txt,

    the disassembled binary compiled with default -O0 optimization; objdump -d example_O0

  • For consistency, I used objdump for disassembly for each of these programs. This also provides another view of what you might encounter if you are reverse engineering a compiled ARM binary.

    I chose these programs because they are all relatively small and straightforward, and they implement some common programming constructs, as well as well-known algorithms. I think that their generality makes them a prime candidate to use for disassembly. These programs aren’t particularly spectacular or novel — they implement staples of programming, and are thus less overwhelming than jumping right into disassembling firmware.

    *************

    C Idioms

    *************

    All pages in the C Idioms section use the following format: present the C source code and a relevant excerpt of the disassembled binary, compiled from the C source file with default -O0 optimization. I have not included the entire disassembly output, for the sake of not overwhelming the reader with an even larger wall of text. On each page, the C source code will be shown in a column to the left and the relevant ARM assembly will be in a column to the right — unless you are viewing this on mobile, in which case, the columns will be stacked and the C source will be in a column above the ARM assembly. The entire disassembly file is available in the GitHub repo linked above.

    Each page features as verbose a walkthrough of the disassembly as I determined was relevant to understanding the inner workings of the ARM assembly and its relationship to its C source equivalent. In most cases, I go line by line and explain what each ARM instruction is doing and how it relates to a specific C idiom. In some cases, the later walkthroughs specifically, I only highlight the most important instructions as they pertain to the idioms in question. At the end of each walkthrough, I have included a section to ~*apply what you’ve learned*~ where I present the disassembled output for a different but related C program, as an exercise for the reader. You can work through the ARM assembly and ~*reverse engineer*~ the original C program. To check if you have the right idea, the corresponding ARM disassembly is available to view by toggling the button at the bottom of that section.

    If you have questions, feel free to reach out on Twitter or by email (both are linked in the footer of this site).

    One-Dimensional Arrays

    Bitwise Operators

    N Lowest Numbers in a List