The Open Source Perspective
Table of Contents

Author:Diomidis Spinellis
Publisher :Addison Wesley
ISBN :0-201-79940-5
Pages : 528

Chapter 1: Introduction

Software source code is the definitive medium for communicating a program's operation and for storing knowledge in an executable form.

Most programming courses and textbooks focus on how to write programs from scratch. However, 40% to 70% of the effort that goes into a software system is expended after the system is first written. That effort invariably involves reading, understanding, and modifying the original code. In addition, the unrelenting, inevitable accumulation of legacy code; the increasing emphasis placed on software reuse; the high human turnover rates associated with the software industry; and the rising importance of open-source development efforts and cooperative development processes (including outsourcing, code walkthroughs, and extreme programming) make code reading an essential skill for today's software engineer.

In this book you will learn how to

  • Be able to read and understand nontrivial software code
  • Appreciate many important software development concepts
  • Know how to explore large bodies of code
  • Have a reading capability of a multitude of important high- and low-level programming languages
  • Appreciate the intricacies of real software projects

Why and How to Read Code

You may find yourself reading code because you have to, such as when fixing, inspecting, or improving existing code.

Code as Literature

In your code-reading adventures you will inevitably encounter code that should best be treated as an example of practices to avoid. You should not, however, expect to learn sound programming from poorly written code.
Read code selectively and with a goal in your mind. Are you trying to learn new patterns, a coding style, a way to satisfy some requirements?

Code as Exemplar

The key concept when you are using code as exemplar is to be flexible. Be prepared to use a number of different strategies and approaches to understand how the code works.

If a strategy does not quickly produce the results you want, drop it and try something different. Remember, the code you are looking for is there; you just have to locate it.

Maintenance

In other cases code, rather than being an exemplar, may actually need fixing. If you think you have found a bug in a large system, you need strategies and tactics to let you read the code at increasing levels of detail until you have found the problem. The key concept in this case is to use tools.

Evolution

In most situations you will be reading code not to repair a fault but to add new functionality, modify its existing features, adapt it to new environments and requirements, or refactor it to enhance its nonfunctional qualities. The key concept in these cases is to be selective in the extent of the code you are examining; in most situations you will actually have to understand a very small percentage of the overall system's implementation.

Reuse

You might also find yourself reading code to look for elements to reuse. The key concept here is to limit your expectations.

Inspections

Finally, in some work settings, the task of code reading may be part of your job description. A number of software development methodologies use technical reviews such as walkthroughs, inspections, round-robin reviews, and other types of technical assessments as an integral part of the development process.

Here you need to be thorough. Examine code to uncover errors in function and logic.

How to Read this Book

We have arranged the material in an order that will let you progress from the basic to the more sophisticated elements.

Outline

In Chapter 2 we present two complete programs and examine their workings in a step-by-step fashion. In doing so we outline some basic strategies for code reading and identify common C control structures, building blocks, idioms, and pitfalls. We leave some more advanced (and easily abused) elements of the C language to be discussed in Chapters 3 and 5. Chapter 4 examines how to read code embodying common data structures. Chapter 6 deals with code found in really large projects: geographically distributed team efforts comprising thousands of files and millions of lines of code. Large projects typically adopt common coding standards and conventions (discussed in Chapter 7) and may include formal documentation (presented in Chapter 8). Chapter 9 provides background information and advice on viewing the forest rather than the trees: the system's architecture rather than its code details. When reading code you can use a number of tools. These are the subject of Chapter 10. Finally, Chapter 11 contains a complete worked-out example: the code-reading and code-understanding techniques presented in the rest of the book are applied for locating and extracting a phase of the moon algorithm from the NetBSD source code base and adding it as an SQL function in the Java-based HSQL database engine.

Chapter 2 Basic Programming Elements

A Complete Program

A very simple yet useful program available on Unix systems is echo, which prints its arguments on the standard output (typically the screen).
More than half of the program code consists of legal and administrative information such as copyrights, licensing information, and program version identifiers.

Functions and Global Variables

When examining a nontrivial program, it is useful to first identify its major constituent parts.
It is a good practice when inspecting code to ensure that all variables needed only in a single file are declared as static.
To understand what a function (or method) is doing you can employ one of the following strategies.

  • Guess, based on the function name.
  • Read the comment at the beginning of the function.
  • Examine how the function is used.
  • Read the code in the function body.
  • Consult external program documentation.

Gradual understanding: Understanding one part of the code can make others fall into place.

while Loops, Conditions, and Blocks

Make it a habit to read the documentation of library elements you encounter; it will enhance both your code-reading and code-writing skills.
The body of a while statement can be either a single statement or a block: one or more statements enclosed in braces. The same is true for all statements that control the program flow, namely, if, do, for, and switch. Programs typically indent lines to show the statements that form part of the control statement. However, the indentation is only a visual clue for the human program reader

switch Statements

When the code for a given case or default label does not end with a statement that transfers control out of the switch block (such as break, return, or continue), the program will continue to execute the statements following the next label. When examining code, look out for this error. In rare cases the programmer might actually want this behavior. To alert maintainers to that fact, it is common to mark these places with a comment, such as FALLTHROUGH.

for Loops

Typically a for loop is specified by an expression to be evaluated before the loop starts, an expression to be evaluated before each iteration to determine if the loop body will be entered, and an expression to be evaluated after the execution of the loop body. for loops are often used to execute a body of code a specific number of times.
In most cases an "infinite" loop is a way to express a loop whose exit condition(s) cannot be specified at its beginning or its end. These loops are typically exited either by a return statement that exits the function, a break statement that exits the loop body, or a call to exit or a similar function that exits the entire program.

break and continue Statements

A break statement will transfer the execution to the statement after the innermost loop or switch statement. A continue statement will reevaluate the conditional expression of while and do loops. In for loops it will evaluate the third expression and then the conditional expression.

To determine the effect of a break statement, start reading the program upward from break until you encounter the first while, for, do,or switch block that encloses the break statement. Locate the first statement after that loop; this will be the place where control will transfer when break is executed.

Similarly, when examining code that contains a continue statement, start reading the program upward from continue until you encounter the first while, for, or do loop that encloses the continue statement. Locate the last statement of that loop; immediately after it (but not outside the loop) will be the place where control will transfer when continue is executed. Note that continue ignores switch statements and that neither break nor continue affect the operation of if statements.

Character and Boolean Expressions

while ( '\'0' <= *cp && *cp <= '9')

This can then be read as a simple range membership test for a character c.

0 <= c <= 9

goto Statements

goto statements can be easily abused to create "spaghetti" code: code with a flow of control that is difficult to follow and figure out.

Refactoring in the Small

If you have control over a body of code, you can profit by reorganizing code sections to make them more readable. This improvement of the code's design after it has been written is termed refactoring.

do Loops and Integer Expressions

The body of a do loop is executed at least once.
The & operator performs a bitwise-and between its two operands

Control Structures Revisited

The first thing you should remember is to examine one control structure at a time, treating its contents as a black box. The beauty of structured programming is that the control structures employed allow you to abstract and selectively reason about parts of a program, without getting overwhelmed by the program's overall complexity.
When going over loop code, you may want to ensure that the code will perform according to its specification under all circumstances

A useful abstraction for reasoning about properties of loops is based around the notions of variants and invariants. A loop invariant is an assertion about the program state that is valid both at the beginning and at the end of a loop. By demonstrating that a particular loop maintains the invariant, and by choosing an invariant so that when the loop terminates it can be used to indicate that the desired result has been obtained, we can ensure that an algorithm's loop will work within the envelope of the correct algorithm results. Establishing this fact, however, is not enough. We also need to ensure that the loop will terminate. For this we use a variant, a measure indicating our distance from our final goal, which should be decreasing at every loop iteration. If we can demonstrate that a loop's operation decreases the variant while maintaining the invariant, we determine that the loop will terminate with the correct result.

Chapter 3 Advanced C Data Types

Pointers

Examples can be found in the pointers-c |pointers section.

Index and Pointer Code for Accessing an Array a with Elements of Type T with pointer p.

Array Index Code Pointer Code
int i; T *p;
i = 0; p = a or p = &a[0]
a[i] *p
a[i].f p->f
i++ p++
i+ = K p += K
i == N p == &a[N] or p == a + N

Structures

A C struct is a grouping of data elements that makes it possible to use them as a whole. Examples can be found in the Structures Section.

Unions

A C union groups together items that share the same storage area. Only one of the items sharing the area can be accessed at the time.

Dynamic Memory Allocation

A data structure whose size is not known when the program is written, or one that grows while the program is running, is stored in memory allocated dynamically while the program is running. Programs refer to dynamically allocated memory through the use of pointers.

typedef Declarations

A typedef declaration adds a new name, a synonym, for an already existing type.
More examples in typedef.

Chapter 4 C Data Structures

Vectors

Vectors are the most common data structures in C programs.
The typical way to process all elements of an array is with a for loop.

Matrices and Tables

You will typically find a two-dimensional data structure referred to as a table in data-processing contexts and as a matrix in mathematical contexts. The two structures are also differentiated by the fact that matrix elements are all of the same type, whereas table elements are in most cases of different types.

Stacks

The ways we can access a stack data structure are much more restricted than those offered by a vector. A stack only allows elements to be added (pushed) or removed (popped) from its end in a last-in-first-out (LIFO) order. In addition, we can query the stack to find out if it is empty.

Queues

A queue is a collection of elements that allows items to be added at the end (the tail) and removed from the beginning (the head)ina first-in-first-out (FIFO) order. You will often find queues used in places where two systems are connected.

Maps

When we use an array index variable to access array elements, we perform a very efficient operation, typically implemented with one or two machine instructions. This feature makes arrays ideal for organizing simple maps or lookup tables keyed by sequential integers starting from zero. In cases where the array elements are known in advance, they can be used to directly initialize the array contents.

Sets

There are cases where we want to efficiently represent and process sets of elements. When these elements can be expressed as relatively small integers, you will find that the typical implementation used involves representing the set as an array of bits, with set membership of each element based on the value of a particular bit. The C language does not have a data type for directly representing and addressing bits as arrays.

Linked Lists

The simplest and most common linked data structure you will encounter in C programs is a linked list. It is constructed by joining together, through a pointer, structures representing the list elements. Elements are typically added to the front of the list—the list head. However, because all list elements are linked together using pointers, elements can be efficiently added or removed from any list position, an operation that may require a significant amount of overhead in large arrays. To locate a particular item in the list, the list must be traversed from its beginning; it is not possible to randomly access elements in it. You will find lists often used to store sequentially accessed data that expands dynamically during the program's operation.

Trees

A number of algorithms and methods of organizing information rely on trees as the underlying data structure. The formal definition of a tree states that its nodes are joined by edges in a way that there is exactly one path from each node to the tree's root. The way a tree's nodes expand at each level is often exploited to efficiently organize and process data.

Graphs

A graph is defined as a set of vertices (or nodes) joined by edges.

Chapter 5 Advanced Control Flow

Recursion

A recursive definition of an entity or an operation defines its object in terms of itself.

Exceptions

An exception mechanism allows programmers to isolate the handling of errors from the code's normal flow of control. You will encounter similar constructs in both C++ and Java programs; some of the errors handled in these languages via exceptions are reported to C programs by signals.

Parallelism

Some programs execute part of their code in parallel to enhance responsiveness to their environment, to structure work allocation, or to effectively use multiple computers or computers with multiple processors.

Signals

Application-level C programs receive external asynchronous events in the form of signals. Examples of signals include a user interrupting a process (SIGINT), the execution of an illegal instruction or an invalid memory reference (SIGILL, SIGSEGV), a floating-point exception (SIGFPE), or the resumption of a stopped process (SIG-CONT). Code responding to signals consists of a signal handler: a function specified to be executed when a given signal is received by the application.

Nonlocal Jumps

The C library setjmp and longjmp functions allow programs to jump back from a function without having to return from each function in the sequence. The functions are typically used to immediately exit from a deeply nested function, often as a response to a signal. The functions are designed to be used together. At the point where nested function calls will return, a setjmp is used to store in a buffer variable the environment's context. From then onward until the function that called setjmp exits, calls to longjmp, with a pointer to the saved environment as an argument, will return execution to the point where setjmp was originally called. To distinguish between the original setjmp call and subsequent returns to it via longjmp, setjmp returns 0 in the first case and a nonzero value in the latter one.

Macro Substitution

The flexibility of the C preprocessor is often used to define simple functions as macros. When reading code that contains macros, keep in mind that macros are neither functions nor statements. Careful coding practices need to be employed when macros are defined to avoid some common pitfalls.

Chapter 6 Tackling Large Projects

Design and Implementation Techniques

Visible Software Process and Disciplined Practices

In large projects, elements of the software lifecycle that are implicitly handled in smaller undertakings are often part of the software code base.

Nontrivial Architecture

Large projects are built using an appropriate architecture that tames its complexity. This architecture typically dictates the system's structure, the way control is handled, and the modular decomposition of the system's elements.

Merciless Decomposition

Decomposing mechanisms: functions, objects, abstract data types, and components.

Support for Multiple Platforms

Large applications often need to address portability concerns.

Object-Oriented Techniques

Objects are organized to hierarchies. Inheritance and dynamic dispatch techniques are then used.

Operator Overloading

Operator Overloading to promote project-specific data types into first-class citizens that are then manipulated using the language's built-in operator set.

Libraries, Components, and Processes

At a higher level of granularity, the code of large systems is often decomposed into libraries of object modules, reusable components, and even separate processes.

Domain-Specific and Custom Languages and Tools

Large coding efforts often involve the creation of specialized tools or benefit from existing ones used in similar endeavors. Examples include javadoc, makefile .

Aggressive Use of Preprocessing

Projects implemented in assembly language, C, and C++ often use a preprocessor to extend the language with domain-specific structures. Modern projects are rarely coded in symbolic language.

Project Organization

You can examine a project's organization by browsing its source code tree. A common strategy for developing multiplatform applications is to isolate the platform-specific code customizing it for each platform. This code can then be put into separate directories.
When you work on a large project for the first time, spend some time acquainting yourself with its directory tree structure.

When browsing a large project, keep in mind that a project's "source code" encompasses a lot more than the computer language instructions that are compiled to obtain an executable program; a project's source tree typically also includes specifications, end-user and developer documentation, test scripts, multimedia resources, build tools, examples, localization files, revision history, installation procedures, and licensing information.

The Build Process and Makefiles

Steps of a typical build process

  • Configure: Involves configuring software options and determining what exactly you want to build.
  • Create Dependencies: Based on the project's configuration we can create a dependency graph. The dependency graph specifies the correct order in which various project components are to be built.
  • Build Tools: Some parts of the build process are not performed by using standard development tools such as the compiler or linker but require tools that have been developed for the specific project.
  • Build Executables: After the project-specific tools are built they can be used, together with other standard tools, to preprocess, compile, and link the project's executable files.
  • Build documentation: As a parallel step, the project's documentation is often converted from a "source" format into the final distribution format.
  • Install Deploy: Finally, the resultant executables and documentation are installed on the target system or prepared for larger-scale deployment or distribution.

Configuration

A system's configuration can occur either at buildtime, in which case it affects both the build process and its results, or dynamically at runtime, in which case it affects only the program's operation.

Configuration data read at runtime can be freely manipulated by the end user. Programs should therefore validate it with the same care as any other user input, while programs running under an elevated security privilege should ensure that they cannot have any of their functions configured by unprivileged users.

Revision Control

A revision control system allows you to view and control this time element by tracking the code's evolution, tagging important occasions, and chronicling the rationale behind changes. In this section we detail how to use a revision control system to comprehend the code's temporal elements.

Project-Specific Tools

Large projects often have unique problems and matching resources to construct specialized tools as part of the implementation process. Custom-built tools are used for many different aspects of the software development process including configuration, build process management,code generation,testing,and documentation.

Testing

As part of the source code reading activity, you should first be able to recognize and reason about test code and test cases, and then use testing artifacts as an aid for understanding the rest of the code.

The simplest type of code used for testing is a statement that generates logging or debugging output. Such statements are typically used by developers to test whether the program's operation agrees with their expectations.
In Unix systems the syslog library function is used to append the information into the appropriate system log.
Java programs often use the freely available log4j library to organize and manage the efficient generation of logging messages.

A more proactive approach has the program code test various conditions that should be true while the program is executing. These conditions, called assertions, are essentially statements specifying an expected outcome if the program works in the way its developer thought it would.

Chapter 7 Coding Standards and Conventions

Significant coding efforts, or projects undertaken within the framework of large organized endeavors such as GNU and BSD, are likely to adopt a set of coding standards, guidelines, or conventions.

File Names and Organization

Most standards specify how files should be named and what extensions are to be used.

Indentation

Programs written in modern block structured languages use indentation to emphasize the nesting level of each block. Style guides often specify the amount and type of whitespace to use for indenting blocks.

Formatting

All coding standards specify the way declarations and each specific statement are to be formatted.

Naming Conventions

Three different ways for composing variable, class, and function identifiers.

  • Use of capitalization: Each new word in an identifier is capitalized.
  • Separation of words by underscores: Each additional word in an identifier is separated by the previous one by an underscore.
  • Initials. Under this scheme the initials of each word are merged to construct a new identifier.

Programming Practices

Many coding guidelines contain explicit rules for writing portable software.

Coding guidelines often also specify how applications are to be coded to ensure a correct or consistent user interface.

Process Standards

A software system is more than its code elements. Consequently many coding guidelines diverge into other fields of the development process, including documentation and the organization of the build and release process.

Chapter 8 Documentation

Documentation Types

A traditionally engineered project will generate over its development process a number of different documents.

The system specification document details the objectives of the system, its functional requirements, management and technical constraints, and cost and schedule parameters.

The software requirements specification provides a high-level description of the user requirements and the overall system architecture, detailing the functional and nonfunctional requirements such as processing, external interfaces, logical database schemas, and design constraints.

In the design specification you will find the description of the system's architecture, data, and code structures, as well as the interfaces between the different modules.

A system's test specification contains a test plan, specific test procedures, and actual test results.

A user documentation comprises a number of different documents, including a functional description, installation instructions, an introductory guide, a reference manual, and an administrator manual.

Reading Documentation

Documentation provides you with a shortcut for obtaining an overview of the system or for understanding the code that provides a particular feature.

Documentation Problems

When reading documentation, keep in mind that documentation may often provide a misleading view of the source code. The two different cases of misrepresentation you will encounter are undocumented features and an idealized presentation

Additional Documentation Sources

When looking for source code documentation, consider nontraditional sources such as comments, standards, publications, test cases, mailing lists, newsgroups, revision logs, issue-tracking databases, marketing material, and the source code itself.

Common Open-Source Documentation Formats

troff

The troff document formatting system is typically used in conjunction with the traditional man or the new BSD mdoc macros to create Unix manual pages. Other descriptive parts of the Unix manuals are also typeset in troff.

Texinfo

The Texinfo macros are processed by the TeX document formatting system. They are used to create on-line and printed documentation for many projects implemented under the GNU effort.

DocBook

DocBook is an XML/SGML application adopted, among other uses, in the Free BSD documentation project.

javadoc

The javadoc application will process Java source code suitably annotated with specially marked comments to generate documentation in a number of different formats.

Doxygen

Doxygen is a documentation system for source code written in C++, Java, IDL, and C.

Chapter 9 Architecture

System Structures

Centralized Repository and Distributed Approaches

Data-Flow Architectures

Object-Oriented Structures

Layered Architectures

Hierarchies

Slicing

Control Models

Event-Driven Systems

System Manager

State Transition

Element Packaging

Modules

Namespaces

Objects

Abstract Data Types

Libraries

Processes and Filters

Data Repositories

Architecture Reuse

Frameworks

Code Wizards

Design Patterns

Domain-Specific Architectures

Chapter 10 Code-Reading Tools

Regular Expressions

The Editor as a Code Browser

Code Searching with grep

Locating File Differences

Roll Your Own Tool

The Compiler as a Code-Reading Tool

Code Browsers and Beautifiers

Runtime Tools

Nonsoftware Tools

Chapter 11 A Complete Example

Goal: The enhancement of the hsqldb database to natively support a new SQL date/time function.
Function PHASEOFMOON: returns the moon's phase on a given date as a number in the range 0–100, with 0 representing a new moon and 100 a full moon.

Overview

I will first examine the system's top-level directory. Then compile the source code and run it.

Attack Plan

I will first search in the SQL documentation for functions with attributes and types similar to those of the function I wish to add. I will then model my function following the structure of those existing functions. To be able to efficiently use plain text-matching for searching through the code, I select a function with an uncommon name (DAYOFWEEK) as opposed to functions with names such as YEAR and HOUR.

I then examine this function and all the functions that it uses. I realize that I cannot use a library method so I will have to calculate the phase of the moon.

Code Reuse

Transform C code to Java code.

Testing and Debugging

Documentation

Need to update the system's documentation.

Observations

Typical software evolution activities will require the understanding of code for a number of different purposes.
In most cases, we simply cannot afford to read and understand the code of an entire software system. Any attempt to precisely analyze code will typically branch into numerous other classes, files, and modules and quickly overwhelm us; we therefore actively try to limit the extent of code we have to understand to the absolutely necessary minimum. Instead of striving for a global and complete understanding of the code, we try to find heuristic shortcuts and employ the build process and the execution of the running system to direct us to the code areas that require our attention.
. To escape, we employ a breadth-first search strategy, attacking code-reading problems from multiple sides until one of them gives way.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-Share Alike 2.5 License.