Name For Functionality, Not Type

I just read a blog by Michel Fortin, where he quotes Joel On Software regarding Hungarian notation, or rather, Hungarian WartHogs. Naming a variable for its type, or a type for its location or namespace, is a mistake.

I agree with Joel on his introduction: there are different levels of programmers and, at some point, your nose simply starts to itch when you see code that looks OK, but really isn’t. More than once (and I have witnesses to this fact) I have repaired bugs that we knew existed, but didn’t know where they were, simply by fixing a piece of code that didn’t “feel” right. For a few months that was a full-time job for me, in fact: I was to look over the shoulders of programmers debugging things and fix their bugs for them. Though I was really good at it, it’s not a great job to have to do every day.

So, I agree that at some point, you start having an idea of what clean code should feel like, and you start trying to explain that to other people. If you’re coding in K&R C, then the original Hungarian Notation that Joel talks about may be a good path to go on. However, if you’re coding in a type-safe language, such as C99 or C++, Hungarian notation, whether it be the app-style or the system-style, is simply a mistake – and a very bad one.

In case Joel reads this: no, I don’t think exceptions are the best invention since chocolate milkshake – and I don’t particularly like chocolate milkshake either. I don’t passionately hate Hungarian notation. I do think, however, that Hungarian notation is a mistake and that if you think you need it, there’s something you are doing wrong.

The Example

Joel gave us an example to get rid of cross-site scripting. I agree cross-site scripting is a problem, but it is a problem only if you don’t obey the rule that you should check what comes into your program with run-time checks – always. Anything you read from a file, a connection, a console, the command-line or any other place where a human being could possibly give you any kind of input, should be considered dirty until cleaned, and should be cleaned as soon as possible. You don’t need any special notation for this (such as us for unsafe string and ss for safe string). In fact, it is a mistake to do that because your name will lie to you. Consider the following code:

s = Request("name")
Write "Hello, " & Request("name")

which Joel “corrected” into

s = Request("name")
Write "Hello, " & Encode(Request("name"))

We agree on the problem of the first version of the code: it is vulnerable to cross-site scripting. We don’t agree on the solution – to encode the string when it is used. I.e., IMHO, the solution should be to make sure the string is never, or at least for as short a period as possible in memory in an unsafe form. I.e., if there is no way to make sure that Request(“name”) returns an encoded (clean) string, the code should be

s = Encode(Request("name"))

Joel proposed this solution but rejected it because you might want to store the user’s input in a database. He’s right on that point – he’s also right to reject his second proposed solution, which is to encode anything that gets output to the HTML. His “real” solution is still wrong, however: the first proposed solution just needs a tweak.

What you need, in this case, is a way to capture your user’s input, clean it and get it in a format that you can meaningfully store in a database and output back to the screen. IMHO, the best way to do that is to use a reversible clean-up method that puts the string in an intermediary form that you can store in the database, and from which you can convert to safely output it to HTML. The intermediate form should be easily recognizable for debugging purposes. I usually use Base64 for this. That way, if you forget to convert from your intermediate form, you are not vulnerable to XSS but you have a (clearly visible) bug. Your database isn’t vulnerable to XSS either, and you don’t need an extra way to make sure of that. Using base64 makes the clean-up completely reversible. However, I concede that this is rather crude. The point is, though, that though this is crude, it precludes from relying on style for the security of the application. Refining the method, wrapping it in an object type of some kind, for example, is straight-forward and comes with more advantages – and very few disadvantages.

The Fragility of Hungarian Notation

Hungarian notation is fragile: you have to rely on the names of your variables to tell you something about their type. Even in the original Hungarian notation, there was no functionality-related information so Joel’s “us”, which contains an unsafe string, could be an unsafe string meaning absolutely anything. But that is not the only problem. Hungarian notation makes your code lie to you. Consider the following code:

1
2
3
4
5
6
7
8
9
10
us = UsRequest("name")
usName = us
recordset("usName") = usName
 
 
 
 
' much later
sName = SFromUs(recordset("usName"))
WriteS sName

which according to Joel is just dandy. That’s nice, until another programmer comes along and inserts something between lines 1 and 2:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
us = UsRequest("name")
usName = us
us = UsRequest("address")
usAddress = us
recordset("usName") = usName
recordset("usAddress") = usAddress
 
 
 
 
' much later
sName = SFromUs(recordset("usName"))
sAddress = SFromUs(recordset("usAddress"))
WriteS sName
WriteS sAddress

which is fine and dandy as well, but let’s say some-one introduces SRequest, which for some reason is more efficient that UsRequest and renders safe strings. The code is changed (under pressure) into this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
us = SRequest("name")
usName = us
us = UsRequest("address")
usAddress = us
recordset("usName") = usName
recordset("usAddress") = usAddress
 
 
 
 
' much later
sName = recordset("usName")
sAddress = recordset("usAddress")
WriteS sName
WriteS sAddress

which means most of the code now lies to you.

The code presented here is trivial and it is unlikely that this specific scenario will occur. However, scenarios like this occur every day, and more and more code is changed to lie to the reader.

You need a style that doesn’t let your code lie to you – and Hungarian notation doesn’t qualify.

Just one more example to drive the point home: in C and C++, the _t suffix traditionally implies that the name denotes a typedef.

What is wchar_t?

In C, it is a typedef.

In C++, it is a built-in type, and the name lies about it.

The functionality of a variable is very unlikely to change. When the code changes enough for a variable’s functionality to change, the variable is usually renamed because it doesn’t feel right to have a variable explicitly say one thing and do another – explicitly, not in some kind of code that you have to decipher. Use that and you’ll be a lot safer.

About rlc

Software Analyst in embedded systems and C++, C and VHDL developer, I specialize in security, communications protocols and time synchronization, and am interested in concurrency, generic meta-programming and functional programming and their practical applications. I take a pragmatic approach to project management, focusing on the management of risk and scope. I have over two decades of experience as a software professional and a background in science.
This entry was posted in Opinions, Software, Software Design. Bookmark the permalink.

4 Responses to Name For Functionality, Not Type

  1. I tend to agree with you, but I think this sentence deserves more explanation: “You need a style that doesn’t let your code lie to you – and Hungarian notation doesn’t qualify.” I believe you say Hungarian notation “doesn’t qualify” because it’s not readable enough, meaning that someone will easily disregard the prefix and not notice his mistakes. While I agree about the poor readability of one or two letter prefixes, I’ll say that it’s way better than no indication at all because at least with them someone familiar with the notation can spot errors by looking at the code line by line.

    The core of the problem with App Hungarian, or short prefixes or suffixes in general, is that they aren’t intuitive and thus easily get disregarded. Any variable names can lie, but some make the lies more obvious than others.

    Also, using “us” or even “unsafe” as a prefix isn’t very useful because it can mean anything (unsafe for what?). A better notation in my opinion would be to use an “html” prefix or suffix for html-formatted names.

    If you really want safety, you could design a special type that the compiler won’t let you mix with regular strings. That special type approach isn’t practical in every situation though; I mean what if I multiply a variable of type “width” with one of type “height”, should I get something of type “area”? How many types are you going to have in a big program then? How many interactions to define between these types?

    Those cases may be better served by a notation. As long as everyone who touch the code understand the notation, any notation can do. But obviously, the less obvious the notation the greater the risk of forgetting about it and disregarding it.

    • I tend to agree with you, but I think this sentence deserves more explanation: “You need a style that doesn’t let your code lie to you – and Hungarian notation doesn’t qualify.” I believe you say Hungarian notation “doesn’t qualify” because it’s not readable enough, meaning that someone will easily disregard the prefix and not notice his mistakes. While I agree about the poor readability of one or two letter prefixes, I’ll say that it’s way better than no indication at all because at least with them someone familiar with the notation can spot errors by looking at the code line by line.

      The reason why Hungarian notation makes your code lie is not just in the shortness of the prefix, it’s because the prefix is intended to tell you something about the variable that should be evident from its functionality (which should be in the name) but actually tells you something else. It gives you a semantic piece of information that is either not necessary or not true.

      The core of the problem with App Hungarian, or short prefixes or suffixes in general, is that they aren’t intuitive and thus easily get disregarded. Any variable names can lie, but some make the lies more obvious than others.

      I agree that any variable name can lie. The reason why I think one should name the variable for its functionality – rather than for its type – is because the functionality of any given variable is far less likely to change during the life-cycle of the code than the type is. Names should also be in plain english, so the lie becomes apparent when it’s there – which increases the threshold to changing the functionality of a variable and decreases the threshold to changing its name when you do.

      Also, using “us” or even “unsafe” as a prefix isn’t very useful because it can mean anything (unsafe for what?). A better notation in my opinion would be to use an “html” prefix or suffix for html-formatted names.

      You should take that up with Joel – he’s the one that came up with “us”. I personally wouldn’t advocate either us, s or html.

      If you really want safety, you could design a special type that the compiler won’t let you mix with regular strings. That special type approach isn’t practical in every situation though; I mean what if I multiply a variable of type “width” with one of type “height”, should I get something of type “area”? How many types are you going to have in a big program then? How many interactions to define between these types?

      Those cases may be better served by a notation. As long as everyone who touch the code understand the notation, any notation can do. But obviously, the less obvious the notation the greater the risk of forgetting about it and disregarding it.

      I agree that there are practical problems if you want the compiler to help you out all the way – especially if you don’t necessarily have a compiler that can do that. That is not what I am advocating, however: the topic being style and notation, I advocate a notational style that names for functionality rather than type. The type system of the language is irrelevant for this argument (though it does help to strengthen the point).

  2. Paercebal says:

    I do use Apps Hungarian notation for my code.

    Shame on me…

    This comes from my past as Javascript coder, and it shows in the markings I use (b for boolean, str for string, a for any kind of container, etc.). It’s now part from my parsing process…

    I have one thing that bothers me a lot : It is to not be able to see at first glance what kind of thing a symbol is.

    So, of course, if I see “strName” and I discover later it was an integer index, one could say the code lied to me, and this is all my fault.

    In the other hand, the following function:

    int add(lhs, rhs)
    {
       return lhs - rhs ;
    }
    

    does lie to me, too (this is my favorite example to explain that operator overloading in C++ is not as evil as some “C-lovers” want to believe it is). Should I verify every implementation of add functions I see in my code ?

    I don’t think so.

    Code will always be able to lie, but the price for not trusting it is the time lost verifying everything.

    I trust the code I wrote, and I trust myself to not create an integer called strName. Now, strName could be a std::string, a std::wstring, a QString or even (Aaargh!) a char *, the thing is, strName contains a string about some name.

    Now the problem is : What about others’ code ?

    “L’enfer, c’est les autres” said someone famous…

    • rlc says:

      Hi Raoul,

      I just came back to this post and noticed I hadn’t replied to your comment…

      You raise three interesting points:
      1. If your code lies to you, it’s entirely your own fault (and you trust the code you wrote);
      2. You have your own set of prefixes that come from your personal experience;
      3. You want to be able to see, at a glance, what kind of thing a symbol is.

      I also trust the code I wrote, but I live in a world where I am not the author of the majority of the code — or, statistically, even a significant part of the code (although I do find me code significant, of course) so if code lies to me, misleads me or otherwise leads me astray, it’s usually not my fault because I didn’t write the code — I just have to live with it.

      One recent post on this blog, about API changes illustrates this nicely, I think: changing an API to ignore a parameter to a well-known function (so well-known that you don’t usually look it up to see how it works) makes code lie — especially if the declaration of the function doesn’t indicate the change either. The kind of lie you illustrate in your comment is that kind of lie: not a change in type, but a change in function. No naming convention will fix that — but good manners will.

      Your second point all but sinks the Hungarian ship: people tend to use their own prefixes, based on personal experience. Sets of prefixes used differ — often subtly — from person to person. Your ‘str’ prefix is another man’s ‘s’ and your ‘a’ for “container” means “array” to someone else.

      What does this code do:

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      
      #include "sy.h"
      extern int *rgwDic;
      extern int bsyMac;
      struct SY *PsySz(char sz[])
         {
         char *pch;
         int cch;
         struct SY *psy, *PsyCreate();
          int *pbsy;
          int cwSz;
          unsigned wHash=0;
          pch=sz;
          while (*pch!=0
             wHash=(wHash<>11+*pch++;
          cch=pch-sz;
          pbsy=&rgbsyHash[(wHash&077777)%cwHash];
          for (; *pbsy!=0; pbsy = &psy->bsyNext)
             {
             char *szSy;
             szSy= (psy=(struct SY*)&rgwDic[*pbsy])->sz;
             pch=sz;
             while (*pch==*szSy++)
                {
                if (*pch++==0)
                   return (psy);
                }
             }
          cwSz=0;
          if (cch>=2)
             cwSz=(cch-2/sizeof(int)+1;
          *pbsy=(int *)(psy=PsyCreate(cwSY+cwSz))-rgwDic;
          Zero((int *)psy,cwSY);
          bltbyte(sz, psy->sz, cch+1);
          return(psy);
          }

      That code was from the original article describing the notation.

      Which leads me to the third point: knowing what kind of thing a symbol is at a glance. I agree that that should be possible, but before I start on the “how”, let me say something about “what” first.

      Webster’s dictionary defines “kind” as the “fundamental nature or quality — essence”. In C++ that means I would want to know whether something is
      a. a type;
      b. a variable;
      c. a function.

      In the case of a variable, I would usually also want to know:
      a. its role (what it’s for);
      b. its scope;
      c. its type (which should be evident from its role).

      In the case of a function, I would want to know what it’s for — which should tell me what it does with sufficient precision to not have to look at its implementation. In the case of a type, I would want to know what it’s for — which should tell me enough to be able to use it. Structural types are a bit of an exception: LARGE_INTEGER is a good example, as its role changes from one use to another, but functional types, which constitute the majority of the types we use, should tell me what their use is (e.g. CertificateStore, Listener, Connection, etc.

      The naming convention should tell me what kind of things symbol is, not what type.

Comments are closed.