I will try to make a list of items here, but I could probably write a book on this.
(Useful pre-reading: About these papers)
- Assert everything.
Assertions take care of white-box testing your code, so that automated software testing can be confined to the realm of strictly black-box testing, as it should. Assertions do not execute on release builds / production runs, so they essentially cost nothing. This means that you can go wild with them:
- Go ahead and assert that your array is sorted before performing binary search on it.
- Verify that your binary search worked correctly by comparing its result against the result of a linear search for the same item.
Yes, the time complexity of these assertions is far greater than the time complexity of the operations that they guard, and this is perfectly fine, because:
- Remember, assertions do not execute on release runs, so they cost nothing.
- On test runs, you are not supposed to be using large amounts of data anyway. When N is small, then O(N) and even O(N2) are not very different from O(log2(N)), which means that even when assertions do execute, they do not matter.
- To the small extent that assertions might nonetheless slow you down during development, you can see it as one more reason why you, as a developer, should have a computer which is much more powerful than the computers of mere mortals --er, I mean, users.
When I look at code, I don't ask myself "should I assert that?" Instead, I ask myself "is there anything that I forgot to assert?" The idea is to assert everything that could possibly be asserted, leave nothing assertable unasserted. I call this The Maximalistic Approach to Error Checking, in contrast to the inadvertently predominant minimalistic approach, where programmers decide on a case by case basis whether to assert something or not, based on completely–oblivious–of–Murphy's–law assumptions about how likely it is to go wrong, inappropriately mixed with misguided performance considerations.
For more information, see michael.gr - Assertions and Testing.
Also note that the attention span horizon of code is the function, so if function f1() asserts some condition and then invokes function f2(), it is perfectly fine for f2() to also assert the same condition. In other words, whether something has already been asserted or not by some other function is irrelevant: each function must assert every condition that pertains to it.
- Do black-box testing, avoid white-box testing. Heed the advice that says test against the interface, not the implementation. Unit Testing is testing against the implementation, so despite the entire software industry's addiction to it, it should be avoided. Incidentally, this means that mocking, despite being an admirably nifty trick, should never be used: if you are using mocks then you are doing white-box testing, so you are doing it wrong.
- For more on why Unit Testing is white-box testing, and why white-box testing is bad, read this: michael.gr - White Box vs. Black-Box Testing
- For more on why mocks in particular are especially bad, read this: michael.gr - If you are using mock objects you are doing it wrong.
-
For what to use instead of mocks, read this:
michael.gr - Software Testing with Fakes instead of Mocks
- For what to do instead of unit testing, read this: michael.gr - Incremental Integration Testing
- If for some reason you must do white-box testing, then you can at least avoid having to do it in code; read this: michael.gr - Audit Testing and this: michael.gr - Collaboration Monitoring
-
Avoid non-determinism in tests. Testing must be completely free from
non-determinism under all circumstances. Since testing code exercises
production code, this means that production code must also be free from
non-determinism, or at the very least any source of non-determinism in
production code must be replaceable during testing with a fake which is
completely deterministic. For example:
-
Never rely on the garbage-collector doing anything other than reclaiming
memory; specifically, never rely on any cleanup operations being
initiated by the garbage-collector. Perform all cleanup explicitly. For
more information, see
michael.gr - Object Lifetime Awareness .
- Never allow any external factors such as file creation times, IP addresses resolved from DNS, etc. to enter into the tests. Fake your file-system; fake The Internet if necessary.
- Never use wall-clock time; always fake the clock, making it start from some arbitrary fixed origin and incrementing by a fixed amount each time it is queried.
-
Never use random numbers; if randomness is necessary in some scenario,
then fake it using a pseudo-random number generator seeded with a known
fixed value. This includes all constructs that use randomness, for
example GUIDs/UUIDs.
- Never allow any concurrency during testing; all components must be tested while running strictly single-threaded, or at the very least multi-threaded but in lock-step fashion.
-
Never rely on the garbage-collector doing anything other than reclaiming
memory; specifically, never rely on any cleanup operations being
initiated by the garbage-collector. Perform all cleanup explicitly. For
more information, see
michael.gr - Object Lifetime Awareness .
-
Minimize state, maximize immutability. Design so that as much code as possible is dealing with data that is immutable. Re-examine every single class which contains mutable members, and many chances are you will find that it could be replaced with an immutable class. Even if not, you might discover that many of its members could be immutable.
Eschew frameworks, technologies, and techniques that prevent or hinder immutability. For example, if you are using some dependency-injection (DI) facility that provides you with auto-wiring, use constructor injection only, so that you can always store in final/readonly members. If your DI facility does not support constructor injection, throw away everything and start from scratch with one that does.
Note, however, that immutability is not important in function-local variables. There is absolutely nothing wrong with function-local mutation if it serves the slightest purpose. Which brings us to the next point:
-
Do overwrite function parameters. There exists a widespread cargo
cult habit among programmers, of never overwriting the value of a parameter
to a function within the function. This habit is so unquestioned that it
enjoys "best practice" status, despite being completely misguided. Some
languages (e.g. Scala) even prohibit it, which is deplorable. Go ahead and
overwrite function parameters (if your language allows it) when the original
parameter value should not be used in the remainder of the function. In
doing so you are minimizing the number of variables that are in scope, and
preventing accidental use of the original value.
The historical origins of the practice of never overwriting function parameters are actually quite funny: some early versions of Fortran (the first programming language) used to pass everything by reference, including constants. So, if you had function F(X) which was invoked with 3 for X, and within F(X) you assigned 5 to x, then from that moment on the constant 3 would actually have the value 5 in your entire program. As a result, early computer scientists decreed that function parameters should never be reassigned. Fortran was soon fixed to correct this problem, but the advise kept being passed from generation to generation of programmers, who have been accepting it without rethinking it. This is cargo cult programming at its finest. -
Avoid Pernicious Local Variable Initialization. Contrary
to what many people falsely think of as "best practice" and "common
knowledge", you should never initialize any variable before you have a
meaningful value to assign to it. For more information, see
michael.gr - Pernicious Local Variable Initialization.
- Avoid b-to-a style conversions, use a-from-b style instead. When I see `A = AfromB( B )` I can immediately tell that it looks correct, since A is on the side of A and B is on the side of B. However, when I see `B = AtoB( A )` I have to stare at it for a little while longer before I can tell whether it is correct or not. Of course, this is a trivial example: in real-world situations, the identifiers, as well as the call chain, could be much longer and much more complicated. This is related to Joel Spolsky's notion that wrong code should look wrong, and it is especially important since the entire industry has traditionally been doing it in precisely the wrong way with B-to-A style conversions.
- Avoid Yoda conditionals. This is the practice of reversing the terms around the equality operator when one of the terms is a constant. You might have seen it the following forms:
- `if( 5 == variable )` instead of the normal `if ( variable == 5 )`.
- `if( "text".equals( variable ) )` instead of the normal `if( variable.equals( "text" ) )`.
Don't do this. The Principle of Least Surprise is not just violated by this construct, it is gang-raped. Plus, in doing this you are most probably engaging in the cardinal sin of silent failure. Here are the reasons often cited for using Yoda conditionals, and their rebuttals:
- Alleged reason #1
- Statement: It will catch accidental use of the assignment operator where the equality operator was intended.
- Rebuttal: Such accidental use should be impossible because your compiler or your IDE should be issuing a warning if you try to do this. If you are not receiving a warning, then you have other, much bigger problems in need of solving, i.e. using the wrong programming language, using the wrong IDE, or trying to write code without first having figured out how to enable all warnings.
- Alleged reason #2
- Statement: It works even if the variable accidentally happens to be null.
- Rebuttal: No, it does not work; it silently fails. If you follow offensive programming, the definition of "it works" is that it produces correct results when given valid input, and it decisively fails when given invalid input.
So, there are two possibilities: either the variable may legitimately be null, or it may not.
- if the variable may legitimately be null, then make it evident by explicitly checking against null.
- if the variable may not legitimately be null, then write the code so that it will not fail to fail if the variable ever turns out to be null.
- Avoid unnecessary braces. Doing so keeps the code more compact, making more statements fit within the screen. The cargo-cult programming convention of enclosing even single-statement blocks within curly braces allegedly avoids bugs caused by trying to add a second statement to the block while forgetting to introduce curly braces. This has actually happened to me once, and the programmer who introduced the bug in my code did not even apologize, because he considered it my fault for not having provided the curly braces for him to insert his second statement in. The fact of the matter is that a decent IDE will point out such a mistake as a formatting violation, so this is not a problem today. Of course, in order to enable the IDE to point out formatting violations you must be keeping a consistent indentation style everywhere, right? Right?
- Avoid Egyptian-style curly braces. People who use Egyptian-style curly braces essentially treat them as noise. I would very much favor a programming language where nesting is based on indentation alone, thus requiring no curly braces; unfortunately, the only such language that I know is Python, which is a scripting language, and therefore out of the question; so, for as long as we are using programming languages that require curly braces, we have to pay attention to them and we cannot just treat them as noise; therefore, absolutely all curly braces must absolutely always be perfectly aligned; period, end of story, discussion is locked and comments are closed.
- Minimize flow control statements, especially the `if` statement. If there is any opportunity to structure a piece of code so as to eliminate an `if` statement, the opportunity should be pursued tenaciously. Of course, by this I do not mean replacing `if` statements with the conditional operator ( a ? x : y ); the conditional operator is nice, because it makes code more expressive and compact, but it is equivalent to an `if` statement, so it too should be eliminated when possible. The `if` statement can be avoided in many cases with the use of calculations, with lookup tables, with the judicious application of inheritance and polymorphism, etc.
-
Favor one and only one way of doing any given thing. If you ask a
hundred programmers to write some code that accomplishes a certain simple
task, you will get a hundred different solutions. These solutions will
reflect different ways of thinking, which is inevitable, but they will also
reflect different coding conventions, which is entirely unnecessary.
Establish conventions that minimize unnecessary differences. One easy way to
achieve this is to stipulate that any construct which is optional must be
omitted. For example:
-
Disallow extra parentheses in expressions.
Unfortunately, compilers by default allow superfluous parentheses without complaining. This has fostered the development of some truly bizarre habits among programmers, such as the construct `return (x);` which is so common that some folks are under the impression that this is the correct syntax, and that `return x;` would constitute a syntax error. Well, guess what: `return x;` is the correct syntax, whereas `return (x);` contains a pair of superfluous parentheses. Configure your compiler or your code analysis tool-set to disallow unnecessary parentheses, so that all code that accomplishes the same thing looks the same.
If you do this, then the tooling will also complain about parentheses that you might be using elsewhere to clarify the order in which calculations are to be performed when you are unsure about the operator precedence rules of the language. There are two things you can do about that:
- Your programming language has a very specific, very well documented, and rather small set of rules that govern operator precedence; these rules are fundamental, and this programming language is your bread and butter; so, learn them. Learn them all by heart, so that you are never unsure about operator precedence, so that you never need extra parentheses for clarification, so that you can disallow extra parentheses everywhere.
- Alternatively, if you do not want to learn the operator precedence rules by heart, then extract sub-expressions into separate (preferably locally nested) functions, so that again, you never need extra parentheses.
-
Disallow optional keywords.
In some languages, certain keywords are implied by default and can be omitted. Unfortunately, in virtually all example code out there, such keywords tend to always be included, which leads people to form the impression that they must be mandatory.
For example, did you know that in C# every class is `internal` by default? This means that you never have to say `internal class Foo { ... }`, you can simply say `class Foo { ... }`.
Furthermore, did you know that in C# every class member is `private` by default? This means that you never have to say `private int foo() { ... }`, you can simply say `int foo() { ... }`.
Again, it is fundamental rules of the language that govern these things, which means that every programmer should know them by heart, which in turn means that nobody should be surprised to see `int foo() { ... }`, and nobody should be wondering what the visibility of `foo()` would be.
-
Disallow the `var` keyword.
If we were to mandate that two lines of code should look identical if they accomplish the same thing, we have two options: either always require the `var` keyword, or completely disallow it.
Always requiring the `var` keyword is not an option, because in many cases the type cannot be inferred from the right hand side, so it must be specified. Thus, we are only left with the option of completely disallowing it, and that is the way to go.
Furthermore, as I explain elsewhere, "absolutely any choice that makes code easier to read is absolutely always preferable over absolutely any choice that makes code easier to write", and the `var` keyword is a prime example of a choice which is easy to write but makes code harder to read, so we should not even be debating this.
If you are not sure about the exact type of the right-hand side of an assignment, or if you do not want to be bothered with having to type it, is perfectly okay to begin with `var x = ...`, but once you have written your entire statement you must always go back to the `var` keyword, and ask your IDE to refactor it and replace it with the actual type.
The `var` keyword is only useful in type casts; I would rather say `var x = (int)y;` than `int x = (int)y;` however, the benefits of being able to disallow `var` with a rule outweigh the convenience of being able to use it in type casts.
-
- Put the complexity in the design, not in the code. If the code does not look so simple that even an idiot can understand it, then the code is too complex. When this happens, it usually means that shortcuts were taken in the design, which had to be compensated for with overly complex code. Make the design as elaborate as necessary so that the code can be as simple as possible. Overly complex code is usually the result of violations of the Single Responsibility Principle. Which brings us to the next point:
- Adhere to the Single Responsibility Principle like your life depends on it. Often, what you think of as a single responsibility can in fact be further sub-divided into a number of more fundamental responsibilities.
- Almost all of the code that we write performs, or can be thought of as performing, some kind of transformation, involving a certain number of participants. For example:
- At the lowest level, an assignment operation transforms each bit of the destination variable into the corresponding bit of the source variable. Obviously it involves two participants: the source and the destination.
- At the highest level, a shopping web site transforms relational data and user input into pixels on the user's browser window and purchase orders in the logistics department. In this simplified view we have four participants, realistically we have many more.
- Most transformations are of the simplest kind, involving only two participants, transforming one into the other. That's great, that's a single responsibility: convert A to B.
- Many transformations involve three participants, A, B and C, and they tend to be appreciably complex.
- In some cases they can be simplified into successive operations, one to go from A to B and another to go from B to C, meaning that there were in fact two different responsibilities which were identified and realized as separate steps.
- However, quite often they cannot be simplified, as for example when we are converting A to C by consulting B. That's a single responsibility which cannot be further broken down.
-
All to often, people manage to involve four or more participants in a
single transformation. These tend to be grotesquely complex, and they
invariably constitute violations of the single responsibility principle.
It goes without saying that they must be avoided at all costs.
Luckily, operations that involve more than 3 participants can always be refactored into multiple successive transformations of no more than 3 participants each, by introducing intermediate participant types if necessary. (I have never heard of this being suggested by anyone before, so this could perhaps be The Mike Nakis Postulate for Simplification.) -
Refactor at the slightest indication that refactoring is due; do not
allow technical debt to accumulate. Avoid the situation of being
too busy mopping the floor to turn off the faucet. Allow a percentage
of sprints to explicitly handle nothing but technical debt elimination. Do
not try to spread the task of refactoring over feature development sprints,
because:
- The refactoring effort will not magically disappear.
- Focus will be diluted.
- Time estimations will suffer.
-
Strive for abstraction and generalization.
The urge to abstract and generalize is often mistaken as having reusability as its sole aim, so it is often met with the YAGNI objection: "You Ain't Gonna Need It". The objection is useful to keep in mind so as to avoid over-engineering, but it should not be followed blindly, because abstraction and generalization have important inherent benefits, regardless of the promise of reusability.
Every problem of a certain complexity and above, no matter how application-specific it might seem to be, can benefit from being divided into a specialized, application-specific part, and an abstract, general-purpose part. Strive to look for such divisions and realize them in the design.
- The application-specific part will be simpler to write and to understand, because it will be free from the incidental complexity represented by the general-purpose part.
- The general-purpose part will be simpler to write and to understand, because it will be implementing an independently knowable, self-contained abstraction.
- Also, the general-purpose part will be fully testable on its own, so you will have assurances that it works, regardless of how the application-specific part uses it, and regardless of how the application-specific part evolves over time.
Note that the above benefits come in addition to the potential benefit of reusability.
In other words, if you can choose between the following two:
- adding 5 lines of application code, vs.
- adding only 2 lines of application code but a whole 10 lines of infrastructure code
- Use domain-specific interfaces. Encapsulate third party libraries behind interfaces of your own devise, tailored to your specific application domain. Strive to make it so that any third-party library can be swapped with another product without you having to rewrite application logic. Conventional wisdom says the opposite: we have all heard arguments like:
- "the best code is the code you don't write" (makes me want to invest in the business of not writing software)
- "a third-party library will be better documented than your stuff" (presumably because documentation is a skill your developers have not mastered)
- "if you run into trouble with a library, you can ask for help on Stack Overflow, whereas with something you have developed in-house, you are stuck" (presumably because your developers know nothing of it, despite working with it every day.)
The truth with application development is that the more you isolate the application logic from peripheral technologies, the more resilient your application logic becomes to the ever changing technological landscape, a considerable part of which is nothing but ephemeral fashions, the use of which is dictated not by actual technological merit, but by C.V. Driven Development instead.
(See https://martinjeeblog.com/2015/03/11/cv-driven-development-cdd/)
Incidentally, this also means one more thing:
- Favor libraries over frameworks. The difference between a framework and a library is that, simply speaking, a library is something that your code invokes, whereas a framework is something that invokes your code. The problem with frameworks is that it is nearly impossible to abstract them away behind custom interfaces; therefore, any code you write using a particular framework will forever be a prisoner of that framework: it will be extremely difficult to replace that framework with a different one without rewriting all your code.
-
Strive for what is simple, not for what looks easy. The simple often
coincides with the easy, but sometimes the two are at odds with each other.
Eschew languages and frameworks that provide the illusion of easiness at the
expense of simplicity. The fact that a particular toolset makes "hello,
world!" an easy one-liner probably means that the hundred-thousand liner
that you are actually aiming for will be unnecessarily complicated and hard
to write.
Watch this: https://www.infoq.com/presentations/Simple-Made-Easy
- Avoid binding by name like the plague. Avoid as much as possible mechanisms whose modus operandi is binding by name: use them only for interfacing with external entities, never for communication between your own modules. REST enthusiasts can cry me a river.
- Binding by name must be avoided even in comments. If you need to refer to an identifier from within a comment, use whatever special notation is offered by the language at hand ({@link ...} in java, <see cref="..."> in C#) so that when you refactor the name of that identifier later, the IDE will also update any comments that contain references to that identifier.
-
Always use strong (static) typing. Avoid any kind of weak typing
(euphemistically called dynamic typing) and avoid languages and
frameworks that require it or even just sympathize with it. Yes, this
includes all scripting languages. Scripting language enthusiasts can cry me
a river. (And yes, this includes Typescript too, simply because it
sympathizes with JavaScript.)
Read this: michael.gr - On Scripting Languages. - Strive for debuggability. For example, do not overdo it with the so-called "fluent" style of invocations, because they are not particularly debuggable. Do not hurry to adopt this or that cool new programming language before you have made sure that debugger support for it is complete and working properly.
- Resist the "idiomatic" craze. Contrary to popular belief, doing things in whatever way is considered idiomatic for the programming language at hand is never an end in and of itself; Avoid the use of idiomatic ways of doing things unless you are convinced they are superior. Some of them are not.
- Strive for testability. Design interfaces that expose all functionality that makes sense to expose, not only functionality that is known to be needed by the code that will invoke them. For example, the application may only need an interface to expose a `register()` and `unregister()` pair of methods, but `isRegistered()` also makes sense to expose, and it will incidentally facilitate black-box testing.
- Enable all warnings that can possibly be enabled. The fact that a certain warning may on occasion be issued on legitimate code, is no reason to disable the warning: the warning must be enabled, and each occurrence of the warning must be dealt with on a case-by-case basis.
- The best way to deal with a warning is to resolve it. For example:
- If your compiler is warning you that a certain cast is redundant, remove that redundant cast. (Duh!)
-
If the compiler is warning you that you are dereferencing a pointer
which might be null at that point, then add a null check before
dereferencing it. (Duh!)
-
If your compiler is warning you that you are invoking an overridable
method from within the constructor of a base class, then do as much
restructuring as necessary so that no such thing is happening.
-
Another way of dealing with warnings is by suppressing them. Of course,
this approach should only be used on perfectly legitimate code that would
become less perfect if it was to be restructured so as to resolve the
warning. Suppression should always be as localized as possible, meaning
that it should be done on the individual statement where the warning is
issued, instead of the entire function or the entire class. Note, however,
that there are certain warnings that should always be properly resolved
and never suppressed; take the invocation of an overridable method from
within the constructor of a base class for example.
-
Some warnings, like "unused identifier", occur on legitimate code too
often for selective suppression to be practical. For those warnings,
consider using an IDE that supports a "weak warning" or "suggestion"
level, which is highlighted inconspicuously, so it can be easily filtered
out by your eyes, but the visual clue is still there in case it points to
something unexpected. Also consider using a better programming language,
which supports a construct known as a "discard variable", allowing the
programmer to explicitly state their intention to let a variable go
unused.
- And of course some silly warnings occur on legitimate code all the time, so it goes without saying that they need to be disabled, but in my experience they are far fewer than the average programmer thinks they are.
-
Thou shalt not suffer a warning to live. Every single warning must
always be resolved immediately upon being introduced. Nobody should ever
commit code that contains warnings, and therefore nobody should ever check
out code that already contains warnings.
This is because a warning always is (or ought to always be) a cause of alarm; however, long-standing warnings constitute long-standing false alarms, so their continued existence causes two things:
- All programmers in the house start becoming insensitive to the alarms, so the alarms start going unnoticed.
- Those programmers who are perfectionists (and those are the best kind of programmers) start becoming mighty annoyed.
Which brings us to the next point:
- Treat Warnings as Errors. Always use the "treat warnings as errors" option of your compiler. If your compiler does not have such an option, throw away everything and start from scratch with a compiler that has it.
The conventional understanding of what the difference is between warnings and errors is that with an error, you have to fix it before you can proceed, whereas with a warning, you can just ignore it and proceed.
This understanding is technically correct, in the sense that this is in fact how compilers tend to behave by default, and this is in turn what most programmers expect, since dumb defaults seem to always suit mindless majorities. However, this conventional understanding, and therefore this default behavior of compilers, is wrong. It has been wrong since the dawn of our discipline, and it continues to be wrong today. The magnitude of the wrongness, multiplied by the pervasiveness of the wrongness, is truly staggering.
The difference between warnings and errors should be that you can suppress a warning if you must, whereas you cannot suppress an error; however, you should absolutely have to address and eliminate both, meaning that you should have to either explicitly suppress or otherwise resolve every single warning before being allowed to proceed.
The "treat warnings as errors" option corrects the wrong behavior of compilers, and exists precisely for the benefit of those (apparently very few) people in our discipline who happen to have their reasoning right on this issue.
Be one of those people. Use that option.
- Strive for readability. Readability is one of the most important qualities of code, second only to correctness. Code is generally read far more often that it is written. Actually, over time, the reads-to-writes ratio for any given piece of code approaches infinity: we tend to read code several times as we write it, at least once more as we review it, and then many more times throughout its lifetime as we extend it, refactor it, or tweak it; as we write nearby code; as we browse through code to understand how things work; as we perform troubleshooting; etc. Therefore:
Absolutely any choice that makes code easier to read is absolutely always preferable over absolutely any choice that makes code easier to write.
This means that languages that achieve great terseness of code are not really delivering anything of value by this alone, (I am looking at you, Scala,) because verbosity of code is not one of the major problems that our profession is faced with; unreadable code is. This also means that certain languages whose grotesquely arcane syntax has earned them the "write-only language" designation are not to be touched with a 10 ft. pole. Perl enthusiasts can cry me a river.
- Avoid using elements of prose in code. Identifiers should be pedantic, not creative, and unless they pertain to the problem domain, they should come from the realm of engineering, not from the realm of literature. Think twice before using a term like "drop" instead of "delete", "payload" instead of "content", etc. because "drop" and "payload" are metaphors. Metaphor should be avoided unless it helps to express something that would otherwise require an entire sentence to express, for example "Factory" instead of "ObjectThatCreatesOtherObjects".
- Use an IDE with a spell checker. Avoid anything that fails to pass the spell check.
Add the spell-checking dictionary of the IDE to source control and review any commits to it just as you review any other code.
This specifically means abandoning certain old habits; all of the following are wrong:
nrPoints; pointsNr; nPoints; pointsN; noPoints; pointsNo; lenPoints; pointsLen
Only the following are right:
numberOfPoints; pointCount; pointsLength
- Avoid acronyms and abbreviations. Use fully spelled-out words of the English language instead.
Acronyms and abbreviations are cryptic for the uninitiated, and even if they are not, they make the code look unnecessarily technical. Modern IDEs have formidable auto-completion features, so fully spelling out every word does not necessarily mean that you will have to type more, but even if it did, typing is not one of the major problems that our profession is faced with; unreadable code is.
This means that a huge number of abbreviations which have traditionally been staple terms in programming, should never be used, or their use should be seriously reconsidered. This includes all of the following: abs, addr, alloc, alt, app, arg, async, attr, auth, avg, bg, bat, bin, bool, buf, buff, btn, calc, cert, char, cls, clr, col, coll, cmd, com, cmp, comp, cfg, conf, config, const, ctx, ctrl, conv, coord, cos, cnt, cur, curr, db, dbg, dec, decl, def, deg, del, desc, dest, dev, diff, dim, dir, disp, div, doc, drv, dyn, env, eq, err, exe, exp, expr, ext, fac, fig, fg, fmt, frac, freq, fn, fun, func, gen, geom, hdr, hex, img, imp, impl, inc, idx, info, init, ins, inst, int, iter, lang, len, lib, lnk, max, mem, msg, mid, min, misc, mod, mul, mut, nav, net, num, obj, org, pkg, param, perf, pic, ptr, pos, pow, pwr, pred, pref, prev, priv, proc, prof, pub, rand, rnd, recv, rec, rect, ref, regex, rel, rem, rm, repo, req, res, ret, rev, sel, seq, svc, sess, sin, sln, src, spec, sqrt, std, stmt, stat, str, sub, sync, tan, tmp, temp, txt, usr, util, var, val, vec, ver, win, wiz.
If a particular acronym is understood by every programmer, then it might be okay to use it in code, but if it is only understood by domain experts, then it is not okay. This is because programmers often work on software for domains on which they are not experts, and even if they do eventually become domain experts, in the beginning they are not, but the beginning is when everything is difficult, so that is precisely the time that you do not want to be adding any extra difficulty to them. This means that very few acronyms are actually okay.
Let me stress this to make sure it is understood: Domain Experts may protest that it is awkward to see a particular term fully spelled out in the code, because the term is so well known, that it appears as an acronym in the entirety of the literature in their field; let them find it awkward, and let them protest. Your code is not part of the literature in their field.
If the choice is made to keep a certain acronym in the code, then the acronym must be turned into a word, meaning that only the first letter may be written in upper-case, while all subsequent letters must always be written in lower-case. For example, if you have decided that you are not going to replace `GUID` with `GloballyUniqueIdentifier`, I am totally with you, but then you must replace it with `Guid`, so that the spell-checker can recognize it as a word and spell-check it. Otherwise, the spell-checker will consider each capital letter individually, and each individual letter passes spell-checking, so anything written in all-capitals essentially circumvents the spell-checker. If "Guid" as a word violates your English-language sensitivities, then please remember that you are writing code, not prose. There is a reason it is called code: it is specifically not prose.
Also beware of abbreviations that do not look like abbreviations. For example, the word "out" can be a word on its own, but more often than not, it is used as an abbreviation of "output". Spell out the full word. As another example, in the methods `ToUpper()` and `ToLower()`, the terms "Upper" and "`Lower" have no inherent meaning of their own; the proper terms that these abbreviations stand for are "UpperCase" and "LowerCase". Use the proper terms.
- Pay attention to naming. Every single concept must have the best name that it could possibly have. Not just a good name, but an excellent name. Unfortunately, finding the right name for things is hard. It is not a coincidence that naming things is One of the Two Hard Problems in Computer Science. (https://martinfowler.com/bliki/TwoHardThings.html)
Strive for a variety of names that uniquely and accurately reflect each concept that you are dealing with. A Thesaurus is an indispensable programming tool.
(I once worked in a metrology environment where both the main entity of interest was called a "Measurement", and the main thing that you could do with it was to perform a "Measurement"; that's deplorable.)
If a certain domain-specific term is problematic in code, then do not use that term in code. Completely ignore the domain experts who will protest that the original term is the established term in the field and it is awkward to see it replaced with something else.
(In that same metrology environment, the goal of the software was to measure and report how something differs from its ideal form; the term used in that field for this kind of difference was "error", so the software was full of identifiers called "error" that did not stand for error as we know it in software; that's deplorable.)
Avoid zero-information names; invest the necessary amount of thinking so that each name gives at least some hint as to what it is about to someone who sees it for the first time. A good rule of thumb for deciding whether a name is good is to ask yourself the following question:
Could the same name conceivably also stand for some unrelated entity in my code base?
(A co-worker of mine once created a namespace called "DataInfo"; that's deplorable.)
In special cases, dare to use names that you may have never heard anyone using before. For example, if you need a Factory of Factories, why not call it Industry?
Read Chapter 2: Meaningful Names of the book Clean Code by Robert C. Martin.
Also read this: michael.gr - Confucius on Naming
Any code written by a programmer whose English language skills are poor should be reviewed by a programmer whose English language skills are good.
- When words need to be combined to form an identifier, the combination must follow general English grammar rules, except for English grammar special cases.
Read this: Software Engineering Stack Exchange: Clean Code: long names instead of comments - https://softwareengineering.stackexchange.com/q/409455/41811
In the following discussion when we speak of a noun or an adjective or a verb we actually mean a sequence of various parts of speech that effectively constitute a noun or an adjective or a verb. For example, `reticulated_spline` is a noun (spline), `reticulated_before_dive` is an adjective (reticulated), and `dive_for_moog` is a verb (dive).
- Types:
- Classes: The name of a class must always be a noun; it must never be an adjective or a verb; no exceptions. Also, the name of a class must always be in singular form; no exceptions. If you need to signify plurality, do not use plural! Instead, append a plurality-signifying term which is in turn a singular noun. For example, if you have a class that stands for a group of entities, do not call it 'Entities', call it 'EntityGroup' instead. (Duh!)
- Interfaces: The name of an interface must be either an adjective, (e.g. Comparable,) or a noun, (e.g. Serializer,) no exceptions. Singular form goes without saying.
-
Enums: The name of an enum type must always be a noun in singular
form, no exceptions. (E.g. WeekDay.Monday instead of WeekDays.Monday.)
- Variables:
- Single-value: The name of a single-value variable must always be a noun in singular form, unless it is of boolean type, in which case it may signify a condition, such as isEmpty.
- Collection: The name of a collection variable must always be a noun in plural form, no exceptions.
- Functions:
- Pure: The name of a function that returns a result without mutating anything must always be a noun unless it returns boolean, in which case it may signify a condition, such as hasChildren(). The name must be in singular form, unless a collection is returned, in which case the name must be in plural form.
-
Impure: The name of a function that performs an operation (has side
effects) must be a verb, no exceptions. If the impure function returns a
result indicating success or failure, the name must begin with 'try'
followed by the actual verb, for example 'tryAdd()'. If the name does not
begin with `try` then the rule is that the function will signal failure by
throwing an exception.
When multiple words are combined to form an identifier, they must still make sense. As an example of what to avoid, take the `INotifyPropertyChanged` interface of WPF. This name is deplorable because notify is a verb, not a noun or an adjective, and because an object implementing this interface is not a property-changed notification, it is an object which may issue property-changed notifications. Admittedly, it is difficult to come up with a good name to describe such objects; a decent choice might be `PropertyChangedNotificationIssuer`, but this might be a bit too long for some people's taste. An alternative is to use a familiar term of broader scope if there is no possibility of confusion. So, another decent choice here might simply be `Mutable`. It is true that all kinds of different classes are mutable without issuing property-changed notifications, but then again the only thing that different mutable classes could have in common simply by virtue of being mutable, so as to warrant a common interface for all of them, is issuing notifications about their mutations. The point to take home from all this is that although it is difficult to come up with good names, the application of some actual thinking should produce a name which is at least a bit better than nonsense.
As mentioned earlier, special cases of the English grammar can, and should, be ignored. An example of this is the simplification of plurals: choose "indexes" instead of "indices", "schemas" instead of "schemata", and, even though I know this is a tough proposition for some, "companys" instead of "companies". See Software Engineering Stack Exchange: Does it make sense to use "ys" instead of "ies" in identifiers to ease find-and-replace functionality?
Never begin a function name with the prefix 'check'. Doing so is a typical example of a developer choosing names according to fleeting notions in their head, without the slightest concern as to how these names will be understood by others. The word `check` means nothing; a function that only checks something and then does nothing about it would serve absolutely no purpose; presumably, whatever checking the function does culminates in taking some kind of action, or returning some kind of result; this is an extremely important piece of information that the name of the function should not fail to clarify; therefore, the name of the function should indicate what kind of action is performed, or what kind of result is returned.
- Avoid conventions that make code look unnecessarily technical. Code is, by definition, already quite technical; we do not need to be making it look even more technical than it already is. Abandon the abhorrent practice of prefixing static variables with "s_", prefixing member variables with "m_", and prefixing private member variables with "_". Modern IDEs can be configured to provide sufficient visual clues about these things via syntax highlighting. If your IDE does not support this, throw it away and find one that does. If you are not using an IDE, then please switch to the arts and humanities.
Avoid Hungarian Notation.(https://en.wikipedia.org/wiki/Hungarian_notation.) For example, no matter how popular it is in the DotNet world, the practice of prefixing interface names with `I` is ill-conceived. What also helps in order to avoid Hungarian Notation is The Maximalistic Approach to Typing, where the nature of a variable is fully determined from its data type without the need for name adornments.
Which brings us to the next item:
- Use the type system to the fullest. Avoid using general purpose data types; try as much as possible to use data types that are specific for the job. A classic example of this is the use of a `Duration` data type instead of an `int` number of milliseconds, but it goes a lot further than that. So, no, your height is not of type `double`, it is of type `Length`; your married status is not a boolean, it is an instance of `MarriedStatus`; a customer id and a product id are not both of type `int`; one is of type `CustomerId`, while the other is of type `ProductId`; and so on. I call this The Maximalistic Approach To Typing. Untyped programming language aficionados can cry me a river.
- Avoid defensive programming. Engage in offensive programming instead.
Defensive programming is summarized by Postel's law, otherwise known as the Robustness Principle, which says: Be conservative in what you do, be liberal in what you accept from others. (See https://en.wikipedia.org/wiki/Robustness_principle.) This principle suggests that besides producing output which adheres to the spec, our software should, as much as possible, be capable of coping with input that is off-spec. In other words, it should be tolerant to error. People imagine that when software behaves like that, it is more robust.
If there is one thing that I have learned in several decades of programming, both from my own code and from code written by others, it is that tolerance towards error leads to anything but bug-free software; it invariably results in chaos; and guess what chaotic software tends to be: buggy.
Read this: http://trevorjim.com/postels-law-is-not-for-you
So, instead of defensive programming, I advocate offensive programming, which means:
- Never allow any slack or leeway, require everything to be exactly as expected.
- Require adherence to the spec even if you have no use for the full precision mandated by the spec.
- Keep tolerances not just down to a minimum, but at absolute zero;
- Never fail silently; fail loudly instead. Fail fast; fail hard; fail eagerly, and enthusiastically.
Examples of offensive programming:
- Avoid conversion functions that return `null` if given `null`; always assert that the parameter is non-null. Better yet, avoid nullability altogether, or use a type system with explicit nullability, so as to restrict it via strong typing to only those places where it is meaningful. The same applies to empty strings: if an empty string is not meaningful somewhere, do not simply cope with it; explicitly and categorically disallow it.
- Avoid things like a `Map.put()` method which either adds or replaces, and instead design for an `add()` method which asserts that the item being added does not already exist, and a `replace()` method which asserts that the item being replaced does in fact already exist.
- In scenarios where an add-or-replace operation seems useful to have, (and in my experience, such scenarios are exceedingly rare,) add such a function but give it a name that clearly indicates the weirdness in what it does: call it `addOrReplace()`. (Duh!)
- Avoid things like a `close()` method which is allowed to be invoked more than once with no penalty: assert that your `close()` methods are invoked exactly once.
- Never use the garbage collector for cleanup; always perform explicit and deterministic clean-up at the exact moment when it is supposed to happen; the cleanup function invoked by the garbage collector should only be used for producing diagnostic messages in case we forgot to do explicit cleanup. Read this: michael.gr - Object Lifetime Awareness
- Use inheritance when it is clearly the right choice. The advice that composition should be favored over inheritance was very good advice in the mid-nineties, because back then people were overdoing it with inheritance: the general practice was to not even consider composition unless all attempts to get things to work with inheritance failed. That practice was bad, and the fact that the predominant language at that time (C++) supported not just inheritance but actually multiple inheritance made things even worse. So the advice against that practice was very much needed back then.
However, the advice is still being religiously followed to this day, as if inheritance had always been a bad thing. This is leading to unnecessarily convoluted designs and much weeping, and wailing, and gnashing of teeth. Even the original advice suggested favoring one over the other, it did not prescribe the complete abolition of the other. So, today it is about time we reword the advice as follows:
Know when to use inheritance and when to use composition.
For a variety of opinions and a lengthy discussion about this, see
https://stackoverflow.com/q/49002/773113
Also heed the advice by Josh Bloch to design and document for inheritance or else prohibit it. (See https://blogs.oracle.com/javamagazine/post/java-inheritance-design-document)
- Favor early exits over deep nesting. This means liberal use of the `break` and `continue` keywords, as well as `return` statements in the middle of a method whenever possible. The code ends up being a lot simpler this way. Yes, this directly contradicts the ancient "one return statement per function" dogma. It is nice to contradict ancient dogma.
-
Avoid static mutable state like anthrax. Yes, this also includes
stateful singletons. The fact that it only makes logical sense to have a
single instance of a certain object in your world is no reason to design
that object, and your world, so that only one instance of them can ever be.
You see, I guarantee to you that the need will arise in the future,
unbeknownst to you today, to multiply instantiate your world, along
with that object in it, which you thought was one-of-a-kind.
As a matter of fact, it is quite likely that you will have to do that anyway, for the purpose of testing.
- Optimize performance bottlenecks, not performance penalties. The ages-old advice to avoid premature optimization is considered common knowledge, but it is a bit vague, so it does not actually register with many folks, who will not hesitate to optimize any code construct that they consider as representing a performance penalty, under the reasoning that if it represents a performance penalty then its optimization is not premature. For this reason, I like to rephrase the advice as "Optimize performance bottlenecks, not performance penalties" to stress the point that just because something represents a performance penalty, it does not mean that it should be optimized. You see, all code takes clock cycles to run, so every single little piece of code that we write represents a performance penalty; if that was sufficient reason to optimize it, then premature optimization would be the order of the day, every day. For something to be considered worthy of optimization, it should not merely represent a performance penalty; it should be proven to represent a performance bottleneck. You do not know whether something is a bottleneck unless you run the completed software system, discover that its performance is unacceptable, and use the profiler to determine exactly where the bottlenecks are. Also, what usually happens in these cases is that you tend to find some nice and formal algorithmic optimizations to apply in just a few places, and make your software meet its performance requirements, without having to go all over the entire source code base and tweak and hack things to squeeze clock cycles here and there.
- Put the tools of the trade into use. Armies of very good developers have worked hard to build these tools, don't you dare make their efforts go in vain.
- Use an IDE.
Programmers who think that they are better off with their favorite text editor and assortment of command line tools should be admitted to rehabilitation.
- Use the build feature of your IDE, which only compiles modified files.
Programmers who habitually perform a full rebuild instead of a plain build should be fired.
- Use your IDE for running tests.
Programmers who habitually run tests via separate tools outside of the IDE should be shot.
- The continuous build pipeline is your second line of defense, not your primary means of building and testing. Your IDE will always be a lot faster, and it has a built-in debugger.
- Use the debugger of your IDE as your first choice for troubleshooting anything, not as the last resort after all other options have been exhausted. This means that you should be using the debugger not only when there is trouble, but always, by default, so that it is ready when trouble occurs. This in turn means that when you want to launch your product, or to run the tests of your product, you should never hit the "Run" key on your IDE; you should hit the "Debug" key instead. Always the "Debug" key. Only the "Debug" key. You are a programmer; act like it.
- Having said all that, I should also add that people who are so attached to their IDE that they program by dragging and dropping code snippets around should perhaps consider that some desktop publishing job might suit them better.
- Do not even think that you are done with testing unless the code coverage tool gives you sufficient reason to believe so.
- Have your IDE perform code analysis on commit, and incorporate even more code analysis in the continuous build.
- Design with reliability as a foundation, not as an afterthought. For example, sharing data in a multi-threaded environment by means of traditional locking techniques ("synchronization") is both error-prone and untestable, because you cannot test for race conditions. Note that "error prone" and "untestable" is a deadly combination; therefore, this way of sharing data should be abandoned. Instead, design for a lock-free, share-nothing approach that works by passing immutable messages, thus eliminating the very possibility of race conditions.
-
Design with security as a foundation, not as an afterthought.
Security is not something that you can add on top of an insecure foundation,
because there exist no automated tests that can detect security hazards and
no amount of carefulness on behalf of programmers that is careful enough.
So, what is necessary is architectural choices that eliminate entire classes
of security hazards. (Do not worry, there will always be other classes of
security hazards to have to worry about.) If a certain architectural choice
is prone to vulnerabilities, do not make that choice. An example of a
vulnerability-prone architectural choice is putting application code on the
web browser, otherwise known as full-stack development. Full-stack
developers can cry me a river.
For more on this, read: michael.gr - What is wrong with Full Stack Development
- Keep the logs clean. Do not vex your colleagues, and do not make your own life harder, with torrential info-level or debug-level spam in the logs. Keep the info-level messages down to an absolute minimum, and once debugging is done, completely remove all the debug-level log statements. Utilize commit hooks that deliberately fail a commit if it contains debug-level logging statements. Regularly use the "blame" feature of the version control system to remind developers of info-level logging statements that they should remove. Never use the log for capturing metrics or any other kind of structured information; use some separate, specialized instrumentation facility for that.
- Make the best out of the log. You should at all times be able to click on a log line in the output window of the IDE and be taken to the source line that generated that log entry, and you should also at all times be able to click on any line of a logged exception stack trace and be taken to the corresponding line of source code. I am appalled by how many programming environments do not offer this as the default mode of operation under all circumstances.
- In the Microsoft Visual Studio world, for a line to be clickable in the output window it must start with a source pathname, followed by an opening parenthesis, a line number, a closing parenthesis, and a colon. It can optionally be prefixed with whitespace.
- Fortunately, both C++ and C# support efficient means of obtaining source file name and line number information: In C++ it is the `__FILE__` and `__LINE__` built-in pre-processor macros, while in C# it is the `CallerFilePath` and `CallerLineNumber` attributes.
- Unfortunately, the pathnames generated by these mechanisms are absolute, meaning that they start from the drive letter and include the kitchen sink, so you might want to programmatically convert them to pathnames relative to the solution folder before logging them. Visual studio also recognizes those, though this is undocumented.
- In the Jetbrains IntellijIdea world, for a line to be clickable in the output window it needs to contain an identifier, followed by an opening parenthesis, a source filename-plus-extension, (but no path,) a colon, a line number, and a closing parenthesis.
- The identifier is meant to be a package name, but Idea does not interpret it in any way, so it can be anything.
- Due to a long-standing bug (which JetBrains refuses to acknowledge or fix) if the word "at" appears in the log line, and if it is in any place other than immediately before the package name, then this mechanism breaks. (Note that this is all entirely undocumented.)
- Note that this mechanism suffers from ambiguity in the case of multiple source files with the same filename. An alternative mechanism is to include a "file://" URI in the log entry, but in order to produce such a URL you would have to figure out the path from the package name, which is doable, but not easy.
- Unfortunately, Java does not provide any efficient means of obtaining source file name and line number information, so one has to generate a stack trace in order to extract this information from it.
- Fortunately, generating a stack trace in the java world is not anywhere near as expensive as in the Microsoft world.
- Unfortunately, it is still unreasonably expensive. You can see this performance penalty as one more reason to keep logging to a minimum.
- Take maxims with a grain of salt. (Especially quantitative maxims, which offer specific numerical limits for things.) When someone says "no function should accept more than 4 parameters" or "no class should be longer than 250 lines" they are usually talking nonsense.
A class should be as long as necessary to do its job, and if that is 2000 lines, so be it. I would much rather keep some ugly code confined in a single class than split it into multiple classes and thus propagate the ugliness in the design.
A function should accept as many parameters as necessary to do its job, and if that is 15 parameters, so be it. I would much rather have a long constructor than a mutable object.
Breaking things down to smaller units should be done because there is some actual tangible merit in doing so, not because some prophecy said so.
-
Private static methods are fine. Really. An instance method has the
entire object state at its disposal to read and manipulate, and this state
may be altered by any other instance method, including instance methods that
this method may invoke. The complexity of this is mind-boggling. A static
method on the other hand is obviously not in a position to read nor alter
any of the object's state, and it is unable to invoke any instance methods
that would do that. By its nature, a static method has to rely exclusively
on parameters, which are all clearly visible at each call site. Thus, a
static method is an immensely less complex beast than an
instance method. What this means is that private static methods are not the
slightest bit evil as some folks believe they are, and we should have more
of them.
Personally, when I have a class that has both complex logic and mutable state, I tend to move the complex logic into private static methods, reducing the instance methods to doing nothing but invoking private static methods, passing instance fields to them and storing results into instance fields as necessary. - Do not fix it unless there is a test for it. I do not yet have an opinion about test-driven development, but what I have found to be immensely useful, is test-driven maintenance. So, when a bug is discovered, which obviously passed whatever automated tests you already had in place, do not hurry to figure out what causes it and fix it. First, write a test that tests for the bug, being completely agnostic of any theory that you might already have as to what is causing the bug. This test should initially fail; if it does not fail, then the bug is not what you think it is, so you have more research to do. If the test fails as it should, then fix the bug according to your theory as to what is causing it. If the test now passes, then your theory was correct. If not, then not only you have not fixed the bug, but you have probably broken something else which used to be fine.
- Avoid death by ten thousand little methods. Again and again I see code bases with multitudes of tiny methods having cryptic names, each containing just one or two lines of trivial code, aiming to ensure that not a single line of code is duplicated anywhere. The downside of this is that it increases the complexity of the call tree and therefore the amount of mental effort required to make sense out of it. A new function is worth introducing if it has a well-defined, meaningful role to play. Difficulty in coming up with a name for a function, or having many functions with names that differ only slightly and fail to readily convey the difference between them, are both good indicators that these functions have no role to play other than to avoid code duplication. Of course there is merit in reducing code duplication, but not when the code in question is trivial. And when you see the possibility to de-duplicate non-trivial code, then the well-defined, meaningful role of the function tends to be immediately obvious, as well as the appropriate name for it.
- Make the best out of break-on-exception. Set up your development tooling, and use whatever runtime mechanisms are necessary, so that the debugger always stops at any statement that throws an unexpected exception.
- Many programmers have the bad habit of doing all their troubleshooting by examining logs and postmortem stack traces and theorizing as to what went wrong, instead of having the debugger break on exception and actually seeing what went wrong. This is extremely counter-productive.
- Unfortunately, exceptions are a somewhat complex topic, programming languages and their run-times behave in complex ways when exceptions are thrown, and debuggers have complex mechanisms for dealing with them, none of which helps. As if that was not enough, it is not always easy to tell when a certain exception should be expected and when it should not be expected.
- Thus, there exist several obstacles to accomplishing proper, usable, break-on-exception:
- Our code throws and catches expected exceptions all the time, or uses external libraries that do so, internally, all the time; clearly, we do not want the debugger to stop on any of those.
- One might think that the solution to this problem would be to configure the debugger to ignore caught exceptions and only stop on uncaught exceptions; unfortunately, that will not work either, because quite often we have exceptions that we consider as uncaught, but technically they are caught; for example:
- An external library invokes our code, and our code throws an exception, which is uncaught as far as our code is concerned, but it is caught by the external library. A typical example of this is event-driven frameworks, i.e. virtually all GUI frameworks, which invoke our code to handle events, and almost always do so from within a try-catch block. Thus, any exception thrown by our event handlers is actually a caught exception, and the debugger will not stop on it.
- In many languages, the `try-finally` clause internally catches exceptions and re-throws them at the end of `finally`, meaning that any exception thrown within the `try` block is technically a caught exception. Thus, a debugger configured to stop on uncaught exceptions will break at the end of the `finally` block, which is completely useless and counter-productive. The same problem is encountered with other constructs which are internally implemented using `try-finally`, such as the synchronization clause, the automatic disposal clause, etc.
- To complicate matters even further, an exception which is unexpected and unhandled under normal circumstances may temporarily become expected and handled during testing. This happens when a test deliberately causes malfunction to ensure that the component-under-test detects it and responds by throwing an exception, which is then caught by the test and examined to ensure that it is the correct exception and it has been correctly filled-in; when this happens, we do not want the debugger to stop, because we do not want our tests to be interrupted by the debugger while everything is proceeding according to plan.
- Here is a StackOverflow question and answer which simplifies things a lot: https://stackoverflow.com/q/71115356/773113
- Write code as if it will be reviewed by someone, even if it never will. Always try to take one more look at the code from a completely agnostic point of view, supposing that you know nothing about what it does, why it does it, how it does it. Does the code still make sense? Is everything obvious? If not, refactor it until it is as plain as daylight. If comments are necessary to explain what is going on, can the code be refactored so that the comments become unnecessary? Which brings us to the next point.
- Avoid writing code comments. Never add a comment in the code unless absolutely necessary. (Note that this applies to code comments, not to public interface comments, which are sometimes nice to have.)
- The purpose of a code comment should be to alert the reader that something special is happening here, which is not obvious, and cannot be explained by any means other than written prose. This should only be necessary in exceptional situations, while the norm should be that the code is always so simple, and so self-explanatory, that no comments are necessary.
- Code comments that simply state the obvious are unwarranted causes of alert, and if you repeat them enough they will force the reader to start treating your comments as noise, and may thus cause the reader to miss that rare comment which was actually important to note.
- Comments tend to be necessary when a piece of code does something unexpected, which is usually code that takes special measures to circumvent some anomalous behavior of some other code. In these cases, explaining what the code does is not even the goal; the goal is to explain why it does it, and in order to explain that you have to describe the anomalous behavior, which may even necessitate listing various usage scenarios that have been tried and results that have been observed. This in turn means that comments worth writing tend to be entire multi-paragraph-long essays explaining strange and complicated situations. In my experience, one-liners are usually of no value.
- Note that when documenting code that circumvents anomalous behavior it is a good idea to assert, if possible, that the anomalous behavior is in fact still present, so that if it gets fixed in a later version, you will take notice so you can remove the code that circumvents it.
- If you find yourself adding a code comment, first ask yourself whether there is any way to restructure the code so that the comment is unnecessary. There are many ways to do this, the simplest one being to extract the code into a separate function that has a self-explanatory name. However, the best way to do it is to restructure the code so that even the explanatory name becomes unnecessary. For example, in old C code you might come across a pointer-returning function whose documentation says that the caller is responsible for freeing the pointer. This is deplorable. Do whatever it takes to avoid this; use a callback, use an allocator parameter, have the caller supply the memory, throw it all away and rewrite it in Java, anything but requiring people to read comments or else they get punished with memory leaks.
- If a comment can be coded as an assertion statement, that's all the better. Comments saying "x must be greater than y here" are retarded. Assert the darn thing, and spare us from the comment, or perhaps use a comment to explain the why, but not the what. The assertion takes care of the what, and it does so unambiguously and definitively, because it compiles and passes the tests, which is something that no comment will ever do.
- If you modify some code, and there is a comment attached to that code, do not forget to do something about the comment: Ideally, your modifications should make the comment redundant, so you should remove it. If not, then at least make sure that the comment is still valid after the modifications. Unfortunately, programmers often leave comments unchanged while changing the code around them, thus making every single comment in the entire code base liable to devolving into being inaccurate, or even misleading, and thus constituting an instance of sabotage. This is happening because:
-
Programmers treat comments as noise, and therefore do not even notice
their presence. (This is why comments should be used very rarely, in
exceptional situations only.)
-
Comments are poorly written, so programmers do not understand them. When
a programmer does not understand a comment, then obviously they
cannot modify it, but it gets even worse: they do not dare to remove it
either, because they assume that it must have some special meaning to
some other programmer. Thus, poorly written comments are very similar to
Persistent Organic Pollutants (W) a.k.a. forever chemicals: once created, they will be causing harm for all eternity.
If a comment does not make sense to you, then find the author, ask them what it means, and update it accordingly. If the author is not around anymore, then ask any other experienced programmer in the shop. If they cannot tell what it means either, then trust me, this comment will never make sense to anyone, so go ahead and remove it.
- If you must write doc-comments, make them good. Ideally, an entity (class or method) should have a well-chosen name and a very simple and straightforward interface or prototype, so that everything is clear at a glance, and therefore no doc-comment is needed. If things are not so simple, then it may be necessary to clarify them with a doc-comment.
- A doc-comment must be as simple and as brief as possible.
- Do not try to follow templates, or if you do, then treat all template fields as optional: skip any information that is not strictly speaking necessary.
- Some bureaucratic documentation guidelines require the doc-comment of a function to follow a specific template which begins with a summary line, is followed by one line for each parameter, and includes one line for the return value. If your function really needs all this information to be explained in a doc comment, then your function must be doing something extremely bizarre. If your function is not doing anything bizarre, then a single summary line might suffice to explain what it does; if so, then skip the extra lines explaining each parameter, as well as the extra line explaining the return value.
- As an example of what to avoid, see the `IEnumerable<T>.GetEnumerator()` method of C#/dotnet. The doc comment says:
- Returns an enumerator that iterates through the collection.
- Returns:
- An enumerator that can be used to iterate through the collection.
As you can see, the documentation is repeating itself. This is wasting the time of anyone attempting to read this documentation. This is annoying. Do not do this.
- A doc-comment is a public interface comment, not an implementation comment. As such, a doc-comment on an entity should explain, in the most brief and abstract terms possible, the following:
- What task it accomplishes.
- What input it accepts.
- What output it produces.
Note that it does not need to address each one of those items separately; a doc-comment on a method which simply says that it "sorts a file in-place" explains all three items in one go.
- A documentation comment should not make the slightest attempt to explain any of the following:
- How the task is accomplished.
- Which entities are expected to invoke the entity.
- Which entities are invoked by the entity.
- The above points are important to state because many misguided practices from the infancy of our discipline have it all wrong by stipulating that documentation comments should include preposterous things such as who invokes whom, completely missing the whole point behind the notion of general-purpose, reusable software and even the fundamental notion of abstraction.
- If you are asking "but shouldn't documentation describe the how?" the answer is no, that's what we write code for. By definition, the only authoritative source of information as to how something is done is the code that does it. As I have already explained, the code must be so simple and so easy to read that English-language prose on top of it should be bringing no added value. As a matter of fact, the presence of prose is dangerous, because quite often people modify the code without bothering to also modify the documentation, which leads to situations where the documentation is misleading.
- If, after looking at the code, something is still unclear, then place a breakpoint and launch the tests; the debugger will make things pretty clear to you.
- If you are wondering how the code works under a case which is not covered by the tests, then fix this by adding a test for that case! (Duh!) Also note that even if there was a "how" section in the doc-comment, it probably would not have covered that special case anyway.
- Always maintain provenance.
- When you copy some code from Stack Overflow, and it is not immediately obvious how the code works, (let alone why the code works,) always add a comment containing a link to the original question-and-answer. Of course this is not necessary if the code that you copied is something fairly standard, like reversing a string; but if the code is anything but standard, (do you have any idea what it takes in Microsoft Windows to have a progress dialog shown while copying files?) then citing your sources is an absolute must.
-
When performing a calculation which involves a certain fixed value, it
goes without saying that in almost all cases, you will
not hard-code a magic number in the calculation; instead,
you will declare a constant with a nice descriptive name for that value,
and use the constant in the calculation. Note that this must be done even
in fairly trivial cases, for example `const int BitsPerByte = 8;` and can
only be skipped in a very small number of utterly trivial cases, for
example when directly multiplying something by 2 in order to double it,
instead of first having to declare a constant for 2.
-
Furthermore, in many cases, declaring a constant is inadequate if it still
begs the question: why this particular value and not some other value? So,
unless the value in question is a Fundamental Constant (W) (e.g. `const double SpeedOfLight = 299792458.0`) you also have to add a
comment to the constant explaining exactly why this particular value was
chosen, or where it came from. So, if you need to use the population of
Mexico City in a calculation, `const int MexicoCityPopulation = 9209944;`
is not enough; it must be followed by a comment saying `//2020 data, from
https://en.wikipedia.org/wiki/Mexico_City`.
- Stick with UTC everywhere. Use UTC and only UTC for any purpose that involves storing, retrieving, communicating, converting, calculating, and doing really anything whatsoever with time, except for the following two cases, and only the following two cases:
- Parsing a string that was entered by the user into a UTC time variable.
- Converting a UTC time variable to a string to be shown to the user.
However: When dealing with events that happen in the future, make sure to also store the targeted time-zone along with the UTC coordinate, because every few years various countries around the world decide to change their daylight savings policy, which means that the mapping from UTC to local time may change, and you have no way of knowing that in advance.
-
Keep technical implementation concerns separate from application
concerns. Application code should not be making assumptions about the technical
details of the system, so that the technical details are free to change with
minimal changes to application code, and vice versa. For example, the
multi-threading regime under which a system operates (whether the system
utilizes a single thread, or multiple discrete threads, or a thread-pool,)
is a technical implementation concern. As such, all knowledge of how
multi-threading is done should be isolated in the relatively small body of
code which wires up (realizes) the system, and all application code should
be capable of operating regardless of the multi-threading regime.
Incidentally, this facilitates running tests under a strictly
single-threaded regime, to ease debugging. Async/await aficionados can cry
me a river.
- Maximize the consistency of code formatting. I would be tempted to say "format code with absolute consistency", but I cannot, because we usually lack the tools to achieve this, so the goal is to strive to get as close as possible to achieving absolute formatting consistency.
In the preface of the very highly acclaimed book "Clean Code" by Robert C. Martin the author mentions some experimental findings indicating that "consistent indentation style was one of the most statistically significant indicators of low bug density." The author also states that "style distinguishes excellence from mere competence", which I think is an excellent observation; however, the conclusion at which the author arrives is unwarranted, because correlation does not imply causation: it is probably not the consistent indentation style that causes fewer bugs, it is the kind of mindset of programmers who strive for a consistent indentation style which also happens to be the kind of mindset that produces fewer bugs. Be the programmer who has that mindset.
If you are one of those programmers who do not particularly care for consistent formatting, I know what you are thinking right now: you are thinking that you are the rare exception to the rule, and that your code is of course awesome and bug-free despite looking sloppy; well, you have every right to think in any way you like about yourself, but I hope you understand that nobody else will be particularly willing to give you the benefit of the doubt.
Note that this does not mean that every programmer must be forced to follow a specific set of formatting guidelines; on the contrary; by using tools to do the formatting for us, we do not have to worry about formatting. The corollary to this is that as an employer, the only kind of code formatting that you have the right to require from programmers is that which can be achieved by means of automatic code reformatting tools that you already have in place.
The point to take home from all this is that the formatting style must be specified in the highest detail possible, the tools must be painstakingly configured to reformat code according to that style, and the guidelines of how to work around limitations of the tools must be laid down and agreed upon before any work is done on a software project, no matter how much effort all of this represents.
- Use tight abstractions. In other words, avoid leaky abstractions.
Joel Spolsky's original 2002 article which formulated the "Law of Leaky Abstractions" (https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractions/) stating that "All non-trivial abstractions, to some degree, are leaky" focused on examples where implementation details of the underlying layer are exposed not by the interface itself, but by observing the performance characteristics of the underlying layer. For example, the interface of two-dimensional arrays is generic enough to allow us to iterate over them either row-first or column-first without having to know their internal memory layout; however, which way we choose can have drastic performance implications, due to memory cache utilization. This means that we do of course have to keep in mind the technicalities of the layer which implements the abstraction; it does not, however, mean that the interface should be compromised in any way.
More often than not, in our daily jobs we have the misfortune of dealing with abstractions that are leaky at the interface level.
- A glaring example of this, in languages like C# and Java, is `class Object`, whose public interface contains a hash-code function, which is entirely out-of-place and unwarranted, because it has to do with an implementation detail of hash-maps. This mishap could have been avoided in a number of different ways, for example:
- Require the programmer to supply, upon hash-map construction, the hashing function to use.
- Require objects intended to be used as keys in a hash-map to implement a `Hashable` interface which defines a hash-function.
Unfortunately, neither of these approaches was chosen, either by Java or by C#, due to some misguided notion of convenience. Instead, the implementation detail of hash-maps that they need a hash function to work with has leaked into `Object`, requiring every single class to have a hash-code method, even if the class will never be used as a key in a hash-map, and even if the class could never conceivably be used as a key in a hash-map, due, to for example, it being mutable.
- Another example is serialization frameworks that leak details about the underlying file format that they work with: every single XML or JSON serialization framework that I have come across does that, so it is specifically a JSON serialization framework, or an XML serialization framework, but not a general-purpose serialization framework.
A proper general-purpose serialization framework would expose no file format details in its interface, thus being replaceable with a different implementation which serializes to and from some other file format, without any changes necessary to the code that uses the framework. I have written such a framework, and I assure you it was not easy, but here is the thing: Doing it right ™ is never easy.
Leaky abstractions are the source of untold suffering in software development, and they must be avoided at all costs. Creating air-tight abstractions is often omitted in the interest of saving time, and people make do with leaky abstractions instead, but this invariably results in orders of magnitude more time wasted over the long run in dealing with the disastrous consequences of the leaky abstractions.
I would dare to propose that the term abstraction has (or ought to have) an inherent notion of absoluteness; just as one can be either pregnant or non-pregnant but not slightly pregnant or almost pregnant, so can an interface either be an abstraction or not an abstraction; it cannot be somewhere in-between. Thus, an incomplete or leaky abstraction should, for all practical purposes, be regarded as not an abstraction. (Because that's what the almost absolute is: non-absolute.)
- Thoroughly emulate any and all hardware. Hardware emulation is a special case of abstraction, where instead of abstracting software we are abstracting hardware. Incomplete hardware emulations are a curse for the same reasons that leaky abstractions are a curse. Hardware emulations must be 100% complete so that any software performing high level operations with the hardware can make use of all of the functionality of the hardware while remaining completely agnostic of whether it is connected to the real hardware or to an emulation thereof.
-
Only use absolute file-system paths. All file-system paths must be
absolute. It is fine to provide the user with the convenience of entering a
relative path, but the relative path must be converted to absolute
immediately upon entering the system. Relative paths are based on the notion
of a "current directory", which is one of the most ill-conceived, misused,
and treacherous notions in the history of programming, because it is a
global mutable variable. (I hope I do not need to explain why a global
mutable variable is evil, right?) Note that the "current directory" is
global not only across all classes of your application, but also across all
threads of your application, and, in DotNet, even global across all
AppDomains of your application, which were supposed to be completely
isolated. Duh!? What were they thinking?
-
Avoid GUIDs (also known as UUIDs.) Never use GUIDs if you can avoid
them. If you must use them, then make sure they are an implementation detail
and that they constitute a side-note of your design, not a predominant
feature of your design. Read this:
michael.gr - What is wrong with UUIDs and GUIDs.
- Do It Right ™. Avoid taking shortcuts in the name of immediate savings now but at the expense of headaches later, because the later headaches invariably end-up costing orders of magnitude more than the immediate savings. When some colleague, manager, or decision-maker suggests to "make it simple now, and worry about making it right later" they are imagining that they are being smart and they are helping optimize things, while in fact they are being a smart-ass, and they are suggesting that a technical crime be committed.
An example of this, which has already been mentioned, is finding proper names for identifiers. If you want to introduce a new identifier, finding a proper name for it may require opening up the thesaurus, spending a considerable amount of time creating a list of candidate words, opening up the dictionary, looking up the exact meaning of each candidate word, applying the process of elimination, etc. So, you can save lots of time right now by skipping all this and simply calling it something meaningless, or worse yet, something inaccurate and therefore misleading. It is a fact that you will indeed experience immediate time savings right now if you do this. However, it is also a fact that the time you save now by performing this act of sabotage against yourself will invariably be paid a hundredfold later, when you and your coworkers will be wondering what on earth was meant by this meaningless name, or struggling with the realization that it is being used in the code in ways that are in conflict with its meaning.
Of course, Do It Right ™ does not apply only to naming, it applies to everything. And when I say everything, I mean E V E R Y T H I N G. The practice of Do It Right ™ must be a conditioned reflex; it must be the default, reliable, fail-safe, look-no-further choice that we always make, without spending time calculating the costs vs. savings of Do it Right ™, debating whether we should Do It Right ™ or not Do It Right ™, etc. The term Do It Right ™ contains within its name the reason why we should Do It Right ™.
No comments:
Post a Comment