develooper Front page | perl.perl6.internals | Postings from May 2001

PDD: Conventions and Guidelines for Perl Source Code

Thread Next
From:
Dave Mitchell
Date:
May 8, 2001 07:23
Subject:
PDD: Conventions and Guidelines for Perl Source Code
Message ID:
200105081345.OAA09346@gizmo.fdgroup.co.uk
Here's something I promised ages ago, but life got in the way:
first draft of the PDD on all things related to coding: comments,
code structure, naming conventions etc etc. Not as exciting as a Schwartzian
Transform perhaps, but still needs discussion.

I have included comments on bits I'm not sure about - these are marked
by a bar(|) in column 0. I went to town a bit on naming conventions - these
very much reflect my own preferences and biases - folks who are actually
perl src developers (as opposed to pontificators like myself) may wish
to reign me in a bit.

The sections on portability and extensibility need fleshing out, as
these are not my strong point.

I havent bothered prettifiyng the spelling, punctuation and grammar yet,
as I'm assuming revisions will have to be made first.



=head1 TITLE

Conventions and Guidelines for Perl Source Code

=head1 VERSION

=head2 CURRENT

   Maintainer: Dave Mitchell <davem@fdgroup.com>
   Class: Internals
   PDD Number: TBD 
   Version: 1
   Status: Proposed
   Last Modified: 7 May 2001
   PDD Format: 1
   Language: English

=head2 HISTORY

Based on an earlier draft which covered only code comments.

=head1 CHANGES

None. First version

=head1 ABSTRACT

This document describes the various rules, guidelines and advice for those
wishing to contribute to the source code of Perl, in such areas as
code structure, naming conventions, comments etc.

=head1 DESCRIPTION

One of the criticisms of Perl 5 is that it's source code is impenetrable
to newcomers, due to such things as inconsistent or obscure variable
naming conventions, lack of comments in the source code, and so on.
Hence this document.

We define three classes of conventions. Those that say
I<must> are mandatory, and code will not be accepted (apart from in
exceptional circumstances) unless it follows these rules. Those
that say I<should> are strong guidelines that should normally be
be followed unless there is a sensible reason to do otherwise.
Finally, where it says I<may>, this is tentative suggestion to be used
at your discretion.

Note this this particular PDD makes some recommendations that are specific
to the C programming language. This does not preclude Perl being
implemented in other languages, but in this case,
additional PDDs may need to be authored for the extra language-specific
features.

=head1 IMPLEMENTATION

=head2 Coding style

The following I<must> apply:

| this section mostly stolen from Porting/patching.pod

=over 4

=item *

8-wide tabs

=item *

4-wide indents for code, 2-wide indents for nested CPP #directives

=item *

ANSI C function prototypes

=item *

| anyone know precisely what the following means?

"K&R" style for indenting control constructs

=item *

Uncuddled elses: ie avoid  C<} else {>

=item *

No C++ style comments (C<//>): some C compilers may choke on them

=item *

Mark places that need to be revisited with XXX and revisit often!

=item *

When a conditional spans multiple lines, the opening brace must line up
with the "if" or "while", or be at the end-of-line otherwise.

=item *

In function definitions, the name starts in column 0, with the
return type on the previous line

=item *

Single space after keywords that are followed by parens, eg
C<return (x+y)*2>, but no space between function name and following paren,
eg C<z = foo(x+y)*2>

=back

The following I<should> apply

=over 4

=item *

Do not exceed 79 columns

=item *

C<return foo;> rather than C<return (foo);>

=item *

C<if (!foo) ...> rather than C<if (foo == FALSE) ...> etc.

=item *

Avoid assignments in conditionals, but if they're unavoidable, use
Extra paren, e.g. C<if (a && (b = c)) ...>

=item *

Avoid double negatives, eg C<#ifndef NO_FEATURE_FOO>

=back


=head2 Naming conventions

=over 4

=item Subsystems and APIs

Perl core will be split into a number of subsystems, each with an
associated API. For the purposes of naming files, data structures, etc,
each subsystem will be assigned a short nickname, eg pmc, gc, io.
All code within the core will belong to a subsystem; miscellaneous code
with no obvious home will be placed in the special subsystem called
misc.

=item Filenames

| I'm not familiar with what restictions other OSes (VMS etc) may place
| on filenames. I have written most of what follows based on what
| appears to to be convention in the current Perl 5 src tree.
| In particular, it might be tidier to put all files associated with
| a particular subsystem in their own subdirectory, (eg pmc/foo.h rather
| than pmc_foo.h) - but since perl5 has all its main code in
| a single directory, I'm vaguely assuming there are good portability
| reasons not to do so.

Filenames must be assumed to be case-insensitive, in the sense that that
you may not have two different files called Foo and foo. Normal source-code
filenames should be all lower-case; filenames with upper-case letters
in them are reserved for notice-me-first files such as README, and for
files which need some sort of pre-processing applied to them or which
do the preprocessing - eg a script F<foo.SH> might read F<foo.TEMPLATE>
and output F<foo.c>.

The characters making up filenames must be chosen from the ASCII set
A-Z,a-z,0-9 plus .-_

An underscore should be used to separate words rather than a hyphen (-).
A file should not normally have more than a single '.' in it, and this
should be used to denote a suffix of some description.

Each subsystem I<foo> should supply the following files. (This arrangement is
based on the assumption that each subsystem will (as far as is practical)
present an opaque interface to all other subsytems within the core, as well as
to extensions and embeddings.)

=over 4

=item foo.h

This contains all the declarations needed for external users of that
API (and nothing more), ie it defines the API. It is permissable for the
API to include different or extra functionality when used by other parts of
the core, compared with its use in extensions and embeddings.
In this case, the extra stuff within the file is enabled by testing for
the macro PERL_IN_CORE.

=item foo_private.h

This contains declarations used internally by that subsystem, and which must
only be included within source files associated the subsystem. This file
defines the macro PERL_IN_FOO so that code knows when it is being used within
that subsystem. The file will also conatin all the 'convenience' macros used to
define shorter working names for functions without the perl prefix
(see below).

=item foo_globals.h

This file contains the declaration of a single structure containing the
private global variables used by the subsystem (see the section on globals
below for more details).

=item foo.sym

This file
| (format and contents TBD)
contains information about global symbols associated with the
subsystem, and may be used by scripts to auto-generate such stuff as
the include files mentioned above, linker map tables, documentation
etc, based upon portability and extensibility requirements.

=item foo_bar.[ch] etc

All other source files associated with the subsystem will have the prefix
foo_

=back

| the following is off the top of my head and respesents the idea that
| that the perl src code shouldn't be all dumped directly into the
| top-level directory of the tarball. Suggestions of particular
| directories are tentative to say the least

The top-level structure of the Perl source tarball should be as follows:

    /README,etc     a few top-level documents
    /doc/           Assorted miscellaneous documentation
    /pdd/           The current PDDs
    /perl/          The source code for Perl itself
    /perl/os/foo/   OS-specific source code for operating system foo
    /foo/           The source code for other families of binaries (eg /x2p/)
    /hints/         per-OS build hints files
    /scripts/       scripts needed during the building process
    /t/             scripts used by make test
    /lib/           perl modules ready for installation
    /ext/           perl modules that need compiling
    /pod/           src of the Perl man pages etc

plus others as it becomes necessary.

=item Names of code entities

Code entities such as variables, functions, macros etc (apart from strictly
local ones) should all follow these general guidelines.

=over 4

=item *

Multiple words or components should be separated with underscores rather
than using tricks such as capitalisation, eg C<new_foo_bar> rather than
C<NewFooBar> or (gasp) C<newfoobar>.

=item *

The names of entities should err on the side of verbosity, eg
C<create_foo_from_bar()> in preference to C<ct_foo_bar()>. Avoid cryptic
abbreviations wherever possible.

=item *

All entities should be prefixed with the name of the subsystem they appear
in, eg C<pmc_foo()>, C<struct io_bar>. They should be further prefixed
with the word 'perl' if they have external visibility or linkage,
namely, non-static functions, plus macros and typedefs etc which appear
in public header files. (Global variables are handled specially; see below.)

In the specific case of the use of global variables and functions within a
subsystem, convenience macros will be defined (in foo_private.h) that allow use
of the shortened name in the case of functions (ie C<pmc_foo()> instead of
C<perlpmc_foo()>), and hide the real representation in the case of global
variables.

=item *

Variables and structure names should be all lower-case, eg C<pmc_foo>.

=item *

structure elements should be all lower-case, and the first component of
the name should incorporate the structure's name or an abbreviation of it.

=item *

Typedef names should be lower-case except the first letter, eg
C<Foo_bar>. The exception to this is when the first component is a short
abbreviation, in which case the whole first component may be made
uppercase for readability purposes, eg C<IO_foo> rather than C<Io_foo>.
Structures should generally be typedefed.

| An alternative is to suffix types with _t

=item *

Macros should have their first component uppercase, and the majority
of the remaining components should be likewise. Where there is a family
of macros, the variable part can be indicated in lowercase, eg
C<PMC_foo_FLAG>, C<PMC_bar_FLAG>, ....

| these next few bits may seem excessively detailed, but given how
| easily macros make code unreadable, I think it will be helpful
| to have some basic rules which relate the name of macro to its
| purpose, eg bit-testing vs feature tests etc.

=item *

A macro which defines a flag bit should be suffixed with C<_FLAG>, eg
C<PMC_readonly_FLAG>

=item *

A macro which tests a flag bit should be suffixed with C<_TEST>, eg
C<PMC_readonly_TEST>

=item *

A macro defining a mask of flag bits should be suffixed with C<_MASK>,
eg C<PMC_STATUS_MASK>

=item *

A macro defining an auto-configuration value should be prefixed with C<HAS_>,
eg C<HAS_BROKEN_FLOCK>, C<HAS_EBCDIC>.

=item *

A macro indicating the compilation 'location' should be prefixed with C<IN_>,
eg C<PERL_IN_CORE>, C<PERL_IN_PMC>, C<PERL_IN_X2P>.

=item *

A macro indicating major compilation switches should be prefixed with
C<USE_>, eg C<PERL_USE_STDIO>, C<USE_MULTIPLICITY>.

=item *

| my personal pet peeve: death to dSP and friends !!

Macros must never define or implicity use auto variables unless it
is essential for extensibility. In this case, defining macros should
be prefixed with C<DEFVAR_>, and macros which use said variables should
be prefixed with C<VAR_>, eg

	#define DEFVAR_save_stack	struct Stack *oldsp = sp;
	#define VAR_restore_stack	sp = oldsp;

This then at least provides some warning to the programmer that things
are being done behind his/her/its back.

| further suggestions welcome

=back

=item Global Variables

| What follows is one suggestion for the handling of global variables,
| which allows each subsystem to declare its own variables, but which
| allows for easy extensibility in terms of per-thread globals etc.
| You may want to pick holes in it...

Global variables must never be accessed directly outside the subsytem
in which they are used. Some other method, such as accessor functions,
must be provided by that subsystem's API. (For efficiency the 'accessor
functions' may occasionally actually be macros, but then the rule still
applies in spirit at least).

All global variables needed for the internal use of a particular subsystem
should all be declared within a single struct called foo_globals for subsystem
foo. This structure's declaration is placed in the file foo_globals.h. Then
somewhere a single compound structure will be declared which has as members
the individual structures from each subsystem. Instances of this structure are
then defined as a one-off global variable, or as per-thread instances, or
whatever is required.

Within an individual subsystem, macros are defined for each global
variable of the form GLOBAL_foo (the name being deliberately clunky).
So we might for example have the following macros:

	/* perl_core.h or similar */

	#ifdef HAS_THREADS
	#  define GLOBALS_BASE (aTHX_->globals)
	#else
	#  define GLOBALS_BASE (Perl_globals)
	#endif

	/* pmc_private.h */

	#define GLOBAL_foo   GLOBALS_BASE.pmc.foo
	#define GLOBAL_bar   GLOBALS_BASE.pmc.bar
	... etc ...

=back


=head2 Code comments

The importance of good code documentation cannot be stressed enough.
To make your code understandable by others (and indeed by yourself
when you come to make changes a year later :-), the following conventions
apply to all source files.

=over 4

=item Developer files

| I'm not hung up on the suffix .dev below.
| Any alternative suggestions welcome

For each source file (eg a F<foo.c> F<foo.h> pair), there should be
an accompanying developer file called F<foo.dev>. This text file contains
documentation on all the implementation decisions associated with the
source file. (Note that this is in contrast to PDDs, which describe
design decisions). This is the place for mini-essays on
how to avoid overflows in unsigned arithmetic, or on the pros and cons of
differing hash algorithms, and why the current one was chosen,
and how it works. In principle, someone coming to a particular source file
for the first time should be able to read the F<.dev> file and gain an
immediate overview of what the source file is for, the algorithms it
implements, etc.

Currently no particular format or structure is imposed on the developer file,
| (mainly because I can't think of one! I don't see any particular
| need for pod here)
but it should have as a minimum the following sections:

=over 4

=item Overview

Explain the purpose of the source file.

=item Data structures and algorithms

Explain how it all works.

=item History

Record major changes to the file, eg "we moved from a linked list
to a hash table implemention for storing Foos, as it was found to be
much faster".

=item Notes

Anything that may be of interest to your sucessors, eg benchmarks
of differing hash algorithms, essays on how to do integer arithmetic.

=item References

Links to pages and books that may contain useful info relevant to the
stuff going on in the code - eg the book you stole the hash function from.

=back

=item Top-of-file comments

In addition to the copyright message and optional quote, each source file must
have a short comment at the top explaining the basic purpose of the file, eg

	/* pp_hot.c - like pp.c, this file contains functions that operate
	 * on the contents of the stack (pp == 'push & pop'), but in this
	 * case, frequently used ('hot') functions have been moved here
	 * from pp.c to (hopefully) improve CPU cache hit rates.
	 */

=item Per-section comments

If there is a collection of functions, structures or whatever which
are grouped together and have a common theme or purpose, there should
be a general comment at the start of the section briefly explaining
their overall purpose. (Detailed essays should be left to the developer file).
If there is really only one section, then the top-of-file comment
already satisfies this requirement.

	/* This section deals with 'arenas', which are chunks of PMCs of
	 * a particular type that are allocated in one go. Individual
	 * requests can then be made to grab or release individual PMCs.
	 * For each type foo, there is a pointer called GLOBAL_arena_foo
	 * which blah blah....
	 */

=item Per-entity comments

Every non-local named entity, be it a function, variable, structure, macro
or whatever, must have an accompanying comment explaining it's purpose.
This comment must be in the special format described below, in order
to allow automatic extraction by tools - for example, to generate
per API man pages, B<perldoc -f> style utilites and so on.

Often the comment need only be a single line explaining its purpose,
but sometimes more explanation may be needed. For example,
/* return an Integer Foo to its allocation pool */ may be enough to
demystify the function C<del_I_foo()>

| Should structure elements be individually commented and extractable
| too - or that just getting silly???
|
| At this point I confess to being slightly confused by the current
| embed.pl system for embedding structured info about functions etc
| in the src code as well as at the end of embed.pl itself, along
| with the auto-generation of headers etc. Thus I dont feel qualified
| to make too detailed suggestions about what appears on the same line
| as the /*=for.

Each comment should be of the form

    /*=for api apiname entityname[,entityname..] flags ....(TBC)....
    comments....
    */

where I<apiname> is the API the entity belongs to, eg I<pmc>, and entity
name is the actual name of the function or macro or whatever. Where
there is a whole family of entities that have the same properties and
can be collectively described with a single comment, a list of
entity names can be provided.

| TBC ...

| Should top-of-file and section comments also be structured?
| I can't think of any good reason why.


=item Optimisations

Whenever code has deliberately been written in an odd way for performance
reasons, you should point this out - if nothing else, to avoid some
poor shmuck trying subsequently to replace it with something 'cleaner'.

    /* The loop is partially unrolled here as it makes it a lot faster.
     * See the .dev file for the full details
     */

=item General comments

While there is no need to go mad commenting every line of code, it
is immensely helpful to to provide a "running commentary" every 10 or so
lines say; if nothing else, this makes it easy to quickly locate a
specific chunk of code. Such comments are particulary useful at the
top of each major branch, eg

    if (FOO_bar_BAZ(**p+*q) <= (r-s[FOZ & FAZ_MASK]) || FLOP_2(z99)) {
	/* we're in foo mode: clean up lexicals */
	... (20 lines of gibberish) ...
    }
    else if (...) {
	/* we're in bar mode: clean up globals */
	... (20 more lines of gibberish) ...
    }
    else {
	/* we're in baz mode: self-destruct */
	....
    }

=back

=head2 Extensibility

If Perl 5 is anything to go by, the lifetime of Perl 6 will be at least
seven years. During this period, the source code will undergo many major
changes never envisaged by its original authors - cf threads, unicode
in perl 5. To this end, your code should make as few assumptions as
possible. For example, if your struct eventually needs more than
32 flags, can it be gracefully expanded to more than a single word of
flags? Bear in mind that there may be code in other people's Perl
extensions and code that Perl itself is embedded in, all of which
may be using your stuff. Or there may be other distributions of Perl
using your code. You may find it rather difficult to persuade all these
other programmers to modify their code due to your lack of foresight.

| This needs expanding.
| Can anyone supply some specific dos and donts ?

=head2 Portability

Related to extensibility is portability. Perl runs on many, many platforms,
and will no doubt be ported to ever more bizarre and obscure ones over time.
You should never assume an operating system, processor architecture,
endian-ness, word size, or whatever. In particular, don't fall into
the any of the following common traps:

| TBC ... Any suggestions welcome !!!


=head2 Performance

We want Perl to be fast. Very fast. But we also want it to be portable
and extensible. Based on the 90/10 principle, (or 80/20, or 95/5,
depending on who you speak to), most performance is gained or lost in
a few small but critical areas of code. Concentrate your optimisation
efforts there.

Note that the most overwhelmingly important factor in performance is
in choosing the correct algorithms and data structures in the first
place. Any subsequent tweaking of code is secondary to this. Also, any
tweaking that is done should as far as possible be platform independent,
or at least likely to cause speed-ups in a wide variety of environments,
and do no harm elsewhere. Only in exceptional circumstances should
assembly ever even be considered, and then only if generic fallback code
is made available that can still be used by all other non-optimsed
platforms.

Probably the domininant factor (circa 2001) that effects processor
performance is the cache. Processor clock rates have increased far
in excess of of main memory access rates, and the only way for the
processor to proceed without stalling is for most of the data items
it needs to be found to hand in the cache. It is reckoned that even a
2% cache miss rate can cause a slowdown in the region of 50%. It is for this
reason that algorithms and data structures must be designed to be
'cache-friendly'.

A typical cache may have a block size of anywhere between 4 and 256
bytes.  When a program attempts to read a word from memory and the word
is already in the cache, then processing continues unaffected.
Otherwise, the processor is typically stalled while a whole continguous
chunk of main memory is read in and stored in a cache block. Thus,
after incurring the initial time penalty, you then get all the memory
adjacent to the initally read data item for free.  Algorithms that make
use of this fact can experience quite dramatic speedups.  For example,
the following pathological code ran four times faster on my machine by
simply swapping C<i> and C<j>.

    int a[1000][1000];
    
    ... (a gets populated) ...
    
    int i,j,k;
    for (i=0; i<1000; i++) {
	for (j=0; j<1000; j++) {
	    k += a[j][i];
	}
    }

This all boils down to: keep things near to each other that get accessed
at around the same time. (This is why the important optimisations
occur in data structure and algorithm design rather than in the detail of
the code.)

If you do put an optimisation in, time it on as many architectures
as you can, and reject it if it slows down on any of them! And remember
to document it.

| Any other generic suggestions for optimising ??

=head1 REFERENCES


The section on coding style is based on Perl5's F<Porting/patching.pod>
by Daniel Grisinger. The section on naming conventions grew from some
suggestions by Paolo Molaro <lupus@lettere.unipd.it>. The rest of it
is probably my fault.



Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About