This is BrainLog, a blog by Dan Sanderson. Older entries, from October 1999 through September 2010, are preserved for posterity, but are no longer maintained. See the front page and newer entries.

July 31, 2007

Indenting Source Code in Emacs

Most Emacs programming modes include automatic indentation facilities. If you press the tab key on any line, Emacs pushes the line into an appropriate place according to the structure of the code around it.

In most cases, the amount of indentation to use for any given line of computer source code is purely a matter of style that only human beings care about. Most programming languages use other grammatical elements, like curly brackets { and }, to represent code structure. The layout is just to make it easier for humans to follow:

for (int i=0; i < lst.length; i++) {
    initializeFlubber(lst[i]);
    if (isFlubberReady(lst[i])) {
        print("READY #%d: %s\n", i, lst[i].desc);
    }
}
destroyAllFlubbers();

Since indentation style is arbitrary, different programming communities tend to use different styles. Emacs modes have default indentation behaviors that are largely acceptable by most communities, both because the defaults were based on existing popular styles, and because the defaults tend to influence popular style because it's easier to use the default than to change it to something else.

Of course, if your community uses a different style, you can customize Emacs to do it your way.

Using Space Characters For Indentation

Text files contain a series of codes that each represent a letter, number, or punctuation mark in the order they appear in the document represented by a file. The possible codes also include a code for a single space, such as what you get when you hit the space bar, and a code for a "tab."

Just as typewriters used mechanical "tab stops" for lining up columns and indenting paragraphs, computer terminals respond to tab codes in text files by moving over to the next column position. Terminals place tab stops at regular intervals, though the size of the interval can vary and is sometimes adjustable, typically every 4 or 8 character widths.

A file containing these characters:

First<TAB>Second
12<TAB>532
998<TAB>7
<TAB>END

would look like this with a tab width of 8 characters:

First   Second
12      532
998     7
        END

To set the tab stop interval in Emacs, affecting how tab codes in a file change the display:

(setq tab-width 4)

The notion of adustable tab stops seems ideal for resolving arguments over indentation size: Use a tab character for each level of indent, then let anyone who works with the code use her own preferred tab stop interval on her own editor. If tab characters are used consistently for all indentation, then everyone gets to see the code her own way without imposing her view on anyone else.

But some programming languages and style guides require more sophisticated indentation behavior than tab stops allow. A rare situation might require a "half-tab", for which Emacs would use a mix of tabs and spaces to get the right effect. A half-tab in an editor configured with tab stops 4 characters wide would become 2 spaces in the file. When tabs and spaces are mixed in this way, erratic indentation results with other tab widths: One person's half-tab is another person's quarter-tab.

And not every application lets the user adjust how tabs are displayed, most notably web browsers, which are increasingly used to display source code. Code that looks reasonable with a 4-wide tab stop might look terrible (at least to someone who prefers 4-wide tabs) in Firefox, which uses an 8-wide tab stop. So a mix of tabs and spaces can anger the original author as well as her coworkers.

A file containing these characters:

my $h = { alpha => 1,
<TAB><TAB><SPC><SPC>beta => 2,
<TAB><TAB><SPC><SPC>gamma => 3
<TAB><TAB>};
if ($items->{beta} < 3) {
<TAB>do_something();
}

would look like this with a tab width of 4:

my $h = { alpha => 1,
          beta => 2,
          gamma => 3
        };
if ($items->{beta} < 3) {
    do_something();
}

and would look like this with a tab width of 8:

my $h = { alpha => 1,
                  beta => 2,
                  gamma => 3
                };
if ($items->{beta} < 3) {
        do_something();
}

Most groups I know have standardized on using only spaces, no tabs, for indentation, with the consolation that indentation widths must also be standardized. To configure Emacs to use only spaces for indentation, put this in your .emacs:

(setq-default indent-tabs-mode nil)

To convert a file that contains tabs and spaces to one that contains only spaces, use the untabify command: M-x untabify

Whether you prefer tabs or spaces, all of the other Emacs settings regarding indentation will do what you ask them to. If the indentation required for a line is 6 characters wide, your tab-width is 4, and your indent-tabs-mode is non-nil, Emacs will use 1 tab and 2 spaces to produce the desired effect. If your tab-width is 6, Emacs will use a single tab. If indent-tabs-mode is nil, Emacs will use 6 spaces.

See also: Tabs versus Spaces: An Eternal Holy War by Jamie Zawinski; Why I prefer no tabs in source code by Adam Spiers; Use Tabs in Source Code by John Levon; Why I love having tabs in source code by Charles Samuels.


setq-default

indent-tabs-mode is an example of a "buffer-local" variable: Its value can vary from buffer to buffer, and changing its value for one buffer does not change it for other buffers. setq-default establishes a global default value for the variable, which a buffer uses if it does not have its own value.

A variable becomes a buffer-local variable when it is declared, typically in the Emacs Lisp code for a major mode. Once a variable is declared buffer-local, attempts to change its value with setq will only change the value for the current buffer.

You can usually determine if a variable is buffer-local (or file-local) by reading its documentation: C-h v indent-tabs-mode


Adjusting the Basic Indentation Width Used in Code

Many major programming languages use indentation to represent the block structure of the code: A function definition begins at one level of indentation, the body of that definition is indented one level in from that, the contents of an "if" statement in the body are indented one level in from that, and so on. Each level of indentation is typically the same width, such as 4 characters wide.

The C programming language uses this style of indentation, as do C++, Java, and ECMAScript (JavaScript). Because these languages all have similar indentation styles, they all share the code for c-mode to implement indentation. c-mode determines the basic indentation width from a variable called c-basic-offset, which you can adjust.

To set the default basic indentation width for all c-mode-derived languages, put a line like this in your .emacs file:

(setq-default c-basic-offset 4)

Language modes that use a basic offset calculate other indentation widths in terms of the basic offset. For example, if a situation calls for 3-and-a-half indent levels, the indentation width would be the basic offset times 3.5. This means you typically don't need to worry about those special cases. Simply set the basic offset according to your community's style guide, and the special cases will work as expected.


Setting the Basic Indentation Width For a Specific Language

If you set the c-basic-offset variable with setq-default as above, you will set it for all languages that derive from c-mode. If you want to set it for just one language and not others, you can define a "hook," a subroutine that gets called when the particular language mode activates. For example, to set the indentation width for just java-mode:

(defun my-java-mode-hook ()
  (setq c-basic-offset 4))
(add-hook 'java-mode-hook 'my-java-mode-hook)

Notice that this uses setq and not setq-default. This is because the setting ought to be local to the buffer that's activating the mode. If another buffer uses a different mode based on c-mode, it will use its own local value of c-basic-offset.

Setting c-basic-offset (and other values) with a mode hook ensures that your definition overrides other attempts to set its definition.


Automatically Applying Indentation Rules To Code

For most languages, pressing the tab key on any line in a programming mode moves the current line to the appropriate level of indentation, and does nothing if it's already there. You can press it anywhere in the line, and it won't mess it up.

To apply the same effect to a region of lines, use the indent-region command, usually bound to C-M-\. As usual, to select a region, put the point (the cursor) on the first line, hit C-SPC, then move to the last line, then perform the desired command.

To fix indentation for the entire buffer, make the entire buffer the region, then do the command: M-< C-SPC M-> C-M-\ I can never remember the keybinding, so I just do this: M-< C-SPC M-> M-x indent-region


Indentation in Python

Emacs does not include a major mode for Python source code. If you do anything with Python source files, get python-mode from the python-mode Sourceforge project site. Emacs with python-mode is one of the best ways to write and edit Python.

Python does not use C-style curly brackets to indicate block structure, and so python-mode does not derive from c-mode or use c-basic-offset. Instead, it has its own offset variable: py-indent-offset

(setq-default py-indent-offset 4)

In Python, indentation is syntactically significant: The only way Python knows whether a line belongs to a block is if the line is indented the same as a previous line in the block, or if it immediately follows the line that opens the block (such as an if statement) and is indented further than the opening line. In contrast, C doesn't care how you indent the lines in the block (even though good practice is to indent them consistently), as long as the lines appear within curly brackets.

Python:

if age < 21:
    print "You can't come in here."
    sys.exit(2)
 
print "Welcome to Charlie's."

C, with typical indentation:

if (age < 21) {
    printf("You can't come in here.\n");
    exit(2);
}
printf("Welcome to Charlie's.\n");

C, with messed up indentation that's ignored by the compiler:

if (age < 21) {
   printf("You can't come in here.\n");
      exit(2);
 }
     printf("Welcome to Charlie's.\n");

Because indentation carries meaning in Python, python-mode cannot definitively determine the correct indentation for any given line. Just as it can't write your code for you, it can't indent your code for you either. You have to choose the level of indentation that means what you intend. python-mode is exceptionally helpful with managing indentation, but it inevitably behaves differently from c-mode.

For instance, in python-mode, there is no notion of fixing raggedly indented code with indent-region, because there is no way for Emacs to know for certain which indents are intentional just from looking at the code. (There is a py-indent-region to help with aligning a section of code with the rest of a block, but it doesn't "fix" indentation within the region as indent-region does for other languages.)

When you press the tab key in python-mode, the current line indents to be a part of the same block as any preceding lines, which may or may not be what you want. Press tab again and the line outdents one level, and so on cycling through the sensible possibilities for the line. python-mode also automatically adjusts indentation of new lines based on the contents of the previous line, which, combined with the tab key behavior, saves about as many keystrokes as is possible.


References

The GNU Emacs Manual: 31.3 Indentation for Programs describes the basic commands and settings for indenting computer source code.

The CC Mode Manual describes c-mode and derived modes. 4.1 Indentation Commands, 10. Indentation Basics, and 10. Customizing Indentation go into complete detail about the indentation engine, how to use it and how to change its behavior.

The best information about python-mode is in the mode's built-in documentation. Start at C-h f python-mode and look around. If you're adventurous, click on "python-mode.el" in python-mode's documentation to browse the source code and all the documentation all at once.

Emacs has a million features for formatting plain text documents rendered in fixed width fonts, as most documents were back in The Olden Days. I can't imagine ever writing a substantial plain text document today, and so can't imagine using these features for much. For source code, I want the programming modes to handle all of the indentation for me, and the features described above generally suffice. GNU Emacs Manual: 29.1 Indentation Commands and Techniques describes a bunch of other workaday indentation commands that mostly don't apply to syntax-sensitive indentation.

comments...

To make the entire buffer the region you could also use: M-x h

ah, code indentation and wrapping. such a rathole, but so irresistible! i took a stab at some elisp that did this a while back: http://snarfed.org/space/fillcode

Nice article.

"M-< C-SPC M-> C-M-\" can be simplified to "C-x h C-M-\", :)

This entry is no longer accepting comments. Feel free to contact me if you have a question or a correction. Thanks!