This started at as an essay for SATN but was too long and a bit rambling so I decided to post it on this site instead for those who find it interesting. But before posting it I decided to upgrade this site so that the contents would be generated from XHTML files instead of directly in order to allow more flexibility in changing the look and style. For SATN I standard blogger tools so I don't have to take responsibility or spend the time maintaining it. For Frankston.com, however, part of the value in what I learn from maintaining the site and experimenting with what is possible.
I've done a lot of writing since Connectivity 2002 but must of it is in the form of long and rambling thoughts that I hope to reduce to short essays that respect the readers time. This essay falls in the middle. I hope it is interesting to read but since it is longer or a bit off topic, I am keeping it out of the mainstream SATN site.
I've also been learning by doing. Or, to be honest about it, hacking. Hacking is a much-maligned word but it is a necessary word. While I it generally involves writing software, the program itself is only a small part of what I'm trying to do. I looked up the word (hack) and was amused by "To write or refine computer programs skillfully". The real definition is there also "To cut or chop with repeated and irregular blows: hacked down the saplings." The proper definition should be "To try to solve a problem with repeated and irregular blows". OK, there's a tad more finesse but it is programming as an iterative process. Sure, some people have the silly idea that you should first design a program carefully and then implement it according to the detailed plan. The problem is that the programming languages are the best tools for exploring the design. For example, if you describe an object and its interfaces, the programming development tools help you keep track of all the information.
I think of programming is something like kneading dough rather than building a bridge. But this is dough that never hardens � you use it but continue to knead it. This places a strong emphasis on readability in that I should be able to look at a section by itself and understand enough to make changes even years later. Cleverness for its own sake is dangerous. The goal is to find a way to express a solution that is obvious.
This same iterative process applies to explaining ideas and organizing my thoughts. To the extent that the ideas are obvious but new I have succeeded. In practice this is very difficult when writing to a wide audience since I can't rely on enough common understanding and have to build that understanding over time. I'm planning to publish some writing as "works in progress". These will appeal to those who already agree with much of what I say but it is still useful to explore the idea with that audience while trying to reach more people.
While such writing is a priority, if we just did the one most important thing nothing else would get done because most things aren't that important. But, together, they are far more important than any single high priority item. Unfortunately we don't know which of these little things are the real leverage points and, more likely, it will be some unexpected combination.
When we still used ribbons (what's a ribbon daddy? A ribbon is a long piece of cloth soaked in ink wrapped around a spindle � a typing element press against it to deposit ink on paper) people would put a lot of effort into writing a memo and then print it using a faded ribbon on off-white paper. Somehow the act of using a fresh ribbon didn't seem that important Yet without it, no one would consider the writing, no matter how good, as even worth looking at.
Sometimes little things count. Little things can be leverage points or just part of a general malaise or annoyance. It's hard to even get people interested in these issues but at least with SATN I can write about these annoyances that continue to fester. These particular annoyances bedevil programmers and those that depend on software.
The big annoyance is that I have so much to say and you have so little interest in reading long things. So I'll just list the points here and you can then read the long version if you are curious. As a plug for doing that I include some historic background about why these particular annoyances are what they are. They generally have strong historic reasons. For example the CRLF was a clever way to give the teletype enough time to go to the next line. (What's CRLF, a teletype? See the long version).
That wasn't too bad but I really have a lot more to say on this stuff and if you're interested you can read � and learn about the legacy of the typewriter and how all ideas seem to have passed through Multics.
There is no sharp distinction between minor annoyances and major issues. The bigger distinction is between those issues I can easily explain even if people don't care and more complex issues such as the conflict between the computer as a static appliance and the requirements of dynamic systems. But that is a big topic for another essay.
In the meantime, you can continue read my more detailed comments on these issues.
I've written about this in Essay on Leap Seconds. In terms of computer science the problem with leap seconds is that a minute is no longer a minute. It might be 59 seconds or 61 seconds but we don't know unless we know the context. But libraries that time spans don't support context. And, as I point out, even context is not sufficient leap seconds are discovered by inspecting the Earth's rotation and are thus not predictable.
Leap seconds are a good example of what happens when we ask the experts to do something. They tend to care too much and don't realize that it doesn't matter to anyone else. Thanks to the needs of commerce (and the railroads in particular) we don't really care what time it is as much as we care about coordinating events. Clocks used to be set so that it was noon when the sun was directly overhead. But now the clocks show an arbitrary value whose main virtue is that it is the same value as nearby clocks in the same time zone. A time zone can be stretched so that the difference between "zone" time and "sun" time can be well over an hour. People people (policy makers) understood that the clock time is an artificial construct so came up with the concept of daylight savings time which simply added another layer of naming so that humans could use the same moniker such as seven o'clock and find that the sun is already shining because twice a day the names are changed so that seven AM is moved to when the sun is shining.
The nice thing is that you don't need to figure this all out. All we need to know is that our train leaves the station at 7:45.
Given that the names we use can be hours different from the "real time", what idiot would try to make sure that such an artificial construct is encumbered with a correction that amounts to less than a millionth of a second? They got away with it because no one cared because it didn't matter.
Except to us programmers because we do take time seriously. We don't care when and where the sun shines but we do want our time arithmetic to work. We generally store time as the number of units, say seconds, from a given base time. If we add 60 seconds we assume that it is equivalent to adding one minute. We also have the need to think ahead and represent future times. Thus we can represent 2020 as the number of seconds since 1980.
After all, we aren't asking precisely when the earth will pass a certain point in its orbit; we just want to make sure that if we agree to meet at a certain time we will both be there.
But "those who care too much" about time have foisted the leap second upon us. This is a correct factor that accounts for the slight distortions in the earth's rotation. This is an important consideration when writing software that track GPS satellites. That's fine.
Unfortunately such a factor means that every time calculation on every computer must take into account leap seconds to determine the correct number of seconds between two periods. And this is impossible!!! It's not just difficult but impossible. The difficult part is bad enough since all the time functions in standard programming languages simply don't accommodate such a useless concept. There is simply no value in doing so and a lot of risk in introducing very complicated and subtle bugs. Worse, it means that programs that incorporate the correction factor and those that don't will not come up with the same time values. So better that no program does it.
But this means that the computer time and the official standard time are inherently out of alignment. For no purpose other than the pettiest inconsistency. If you do use a time source to correct the time on your computer, periodically the interval calculation will be wrong. And pity the poor program that runs at sixty seconds after midnight because it doesn't even have the concept.
The reason it's impossible to deal with this is that the knowing how to account for the leap second requires we know the context of the time. We can simply represent the number of seconds in a week because it matters what week and we typically don't know what week when we are working with the generic week.
OK, all of this is bad enough but it gets worse (or better for those easily amused). We cannot anticipate leap seconds because they depend on observing how the earth wobbles. A dam like the Three Gorges in China has a direct effect on the wobble and thus on leap seconds.
Can someone please tell the ITU or whoever sets time standards to stop this nonsense? Keep the correction factor for when it's needed. It's also OK to show off at parties by telling us about leap seconds. Just don't muck up our computers and infrastructure with the flawed concept.
Yes, I know the leap years have similar issues but, unlike leap seconds, the leap year adjustments are big and thus we are very aware of them. Leap seconds are more like germs that make us sick but we can't really deal with them directly.
Notice that this leap second issue is a form of the naming problem we see with the ".com" DNS. The real problem is illiteracy. Naming, representation and ambiguity are fundamental concepts that are simply not taught. In fact, few teachers are probably even capable of dealing with it. Instead we are forced into flawed physical analogies that seem to be fine as metaphors but fail when we actually have to do something real.
Zip Codes, Area Codes and all that.
I've also written a little about this issue so I won't belabor it beyond wishing that we stop this nonsense of splitting area codes (phone numbering prefixes) and recognize that the phone number is used as a stable handle rather than a transient identifier. We see this problem even more dramatically � it takes years to recover after the post office changes a postal code. These are just more examples of confusing names and things. Email addresses and .COM names are other examples. But after saying so much about leap seconds, let's just move on.
I'll try to avoid getting into the naming problem again though the concept of "end of line" can be treated abstractly. CRLF stands for carriage return/Line Feed. In the old days we had typewriters (Daddy, what is a typewriter?). And typewriters had platens which were cylinders that held the paper while we smashed the ribbons (I already told you what a ribbon is) against them. The typewriter was a big metal object and the platen moved while the type bars stayed in place �sort of like keeping your mouth in place as you eat kernels of an ear of corn. You slammed against the platen return handle and the carriage (AKA platen) returned so you could type the next line.
Teletypes were automatic typewriters though they came to move a small type head instead of moving the platen. But we still had to return to type head to the starting position (on the left for English). The control characters (codes below 32 in the standard character sets) were used for these kind of functions. Number 013 in octal (no one ever says 11 though some people seem to use this new fangled hexadecimal and say 0d). The problem is that it take as long time to travel from the right ends to the left edge of the page. When typing at a high speed like ten characters per second it could take two character-times (the unit of measure in such a system) to do this.
We also had to move to the next line, hence the linefeed character. On a typewriter slamming the platen over also engaged the mechanism to advance the paper but on the teletype the functions were separate. While this allowed for overtyping, the real value was in creating a two character pair so that returning the carriage (OK, the type head) would be followed by a linefeed. It had to be in that order so we could start sending the type head to the left and then overlap it with the quicker operation of feeding the paper. Thus the CRLF pair became the standard line separator.
For programmers however, such a pair was an annoyance. Multics (http://www.multicians.org) introduced the notion of a new line character (NL). The challenge was in maintaining compatibility with other systems that maintained this teletype model. Thus the LF was repurposed as the NL. This mean that CRLF and NL would be equivalent since it didn't matter how many times you sent a CR you would still be in the same position. But each LF moved you to a new line. This simplified programming since I just had to deal with a single character. Unix, Multics' cute offspring, retained this as the familiar \n (for those who program in C and many other languages). (The \ was also from Multics).
But some glommed onto the RETURN (CR) character as the end of line and record delimiter. The Macintosh followed this path. This means we have to worry about the two cultures when we exchange files. The problem is that we must know if the data is text (hence characters have to be interpreted as per their function) or binary (where bits are just bits and that octet (AKA byte, AKA character). And most file systems don't have a way to say this. At best we can infer it. Even with the introduction of Unicode we have to use clever tricks to guess the encoding and, for those of you visiting Asian web pages, the guesses are often very wrong.
When Digital Research created CP/M they chose to emulate the teletype itself and went way back to the whole CRLF sequence and Microsoft followed this convention with DOS. Thus the pervasive problem of two character end of line sequences. But with the growth of networks we have to deal with files form all sorts of communities and worry about mixed conventions.
The result is that when dealing with email messages and other files across cultures I have to cruft my code and worry about CRLF vs NL vs CR and combinations of these that might all exist in a single file.
The one piece of good news it that there hasn't been a strong effort to make email more efficient and thus we still use text formats for email messages and encode binary data in ASCII. But when I see a "contents-length" header I still wonder what assumptions are being made about the end of line characters and whether some intermediate system has converted from one convention to another and thus made the count wrong.
TAB vs Whitespace
Closely related is the concept of white space and the tab character. The tab key was used on typewriters for preparing tables (tabulation). It was implemented as a piece of metal raised a little. When you pressed the TAB key it would advance to the next stop. This worked well as long as the final result was ink was that dried in place on the paper. But once we had teletypes the tab became ambiguous. We had to make sure everyone involved agreed to where the tab stops were so the results would look the same. But, as with leap seconds, doing so required context � we had to know where we started in order to know where we would wind up.
Remember that teletypes started out as remote typewriters and the tab you typed had to be sent as such, both for mechanical reasons and to preserve timing (as long as you didn't set the tab stop too far away). Unfortunately we kept the tab character when we used computers because it made it easy to line up statements and indent. The early text editors were very simple and, as with teletypes, just accepted our typing literally. Of course there would be controversies about the "right' tab settings. 10 was common, but we couldn't agree on decimal or octal (8). On a narrow page we'd halve this as 5 or 4.
To this day when we look at a text file or program from a different tab culture we get gratuitous uglification. For no purpose whatsoever. It make so much more sense to just use space characters in text files and not have this nonsense at all. Maybe thirty years ago we couldn't afford the extra space it took but even then it was silly � if it was that important the file could be compressed automatically and unambiguously. So, please, all of you who write programming tools and those of you who write programs, please stop this nonsense and ban the tab!
The tab has two other problems. One is the whitespace problem. Multics took the reasonable approach of treating text operations according to how humans would see the text. Thus the concept of whitespace which treated blank space as the same whether it was due to tabs or spaces. If you typed a tab and then backspaced it would automatically be converted to space characters. In fact if you typed a tab, went back and inserted a character and then spaced past what you had typed it would all be put in a standard order. Remember that we had printing terminals � basically typewriters � and moving over a printed character would not erase it. Yet we still have silly things like tab-delimited files that depend on somehow knowing which empty space is which.
As bad as tabs are in programming, the tab key is a real problem in word processors since it means that the way a document looks on one screen or page will not be the same as on another if there is the slightest change in the environment. It's even more confusing the than the 8/10 controversy because the starting position depends on all of the accidental properties of text layout. Even if we set a tab stop we don't know where we are starting from. Perhaps I should use viral marketing (remember when that term sounded good) to overwrite the tab function an pop-up a big "don't do that � learn how to use tables and other mechanisms" message. Too bad so many "word processors" (a term used for people who process words or, to be exact, process text since writers process words) won't even have the concept that there is an abstraction here.
HTML was a major advance in that it simply didn't have the concepts of tabs. Whitespace was whitespace and it didn't matter if you used tab, space or even newline. It was all the same. Too bad so much work has been done to bring back all of the accidental properties of paper and the legacy of ribbons impregnated with berry juice. HTML is wonderfully ambiguous but too many people confuse the artifact of a document with a the document itself. Maybe in another generation.
I like HTML because it is not beholden to paper. But PDF files were created for those who want to force us to pan around pieces of paper on our screens. I understand the value of having a way to "print" a document in such a way that we don't need to actually squash berry juice. It is a useful tool to quickly take legacy documents and do a bit better than imaging them to bits. At least PDF files allow the imaging of text to our screen in whatever way looks good locally. PDF does offer much more control over presentation than HTML and is a valuable way to repurpose existing documents so they can made available electronically for online viewing (even if awkward) and for local printing.
The problem is that too many people seem to be stuck on trying to control how things look on my screen. Usually it is best to let the information (such as a table) be adjusted for local viewing rather than being stuck in a fixed size. Unfortunately the people have been slow in adopting newer standards that allow for more control over presentation (including Adobe's own SVG format). One reason is, perhaps, that those who are most concerned with retaining full control are still thinking in terms of paper and thus are satisfied with PDF rather than rethinking their presentations for the new medium.
The result is that I have to deal with all of the limitations of paper rather than having information. PDF is a nice crutch � too bad some people confuse it with an effective tool. While I'd rather have PDF than paper I would much prefer people faced up to writing documents for screen viewing.
Note that PDF is based on Postscript and it seemed like a great idea at first. In fact you can see concepts like forms and links in PDF but there is simply no point in investing effort into a secondary technology when HTML already dominates that niche.
While on the subject of PDF I also lament Flash. Flash can be a very powerful tool but all-to-often it is used to create a canned experience � television rather than interaction. When I encounter a Flash web site I expect the worst. Instead of giving me information someone is attempting to entertain me and keep me ignorant. This is not intrinsic with flash but there is a tendency for flash to attract those trained in "multi-media image making" rather than sharing. Just as I'd rather read than be forced to listen to a canned presentation, I'd rather get information rather than being "treated" to an "experience". But this is one more literacy issues. I remember trying to explain that email was an information medium rather than just a way of sending business letters more quickly. Too many classes on web design focus on the glitz rather than the depth. This will change but it will require that those doing the presentations learn more about the technology and not just the "medium".
Back to character sets. Multics was the first system to revel in using the full upper/lower case character set. It seemed like a luxury on the previous generation of computers which were seen as big calculators. On Multics we could finally write readable email. Multics' predecessor, CTSS, used typewriter terminals that gave us a taste of this. In fact, Jerry Saltzer wrote Runoff on CTSS and HTML is a direct descendent of that effort. The terms, by the way, come of the typesetting where the metal slugs used for the characters were kept in the upper and lower draws.
The problem is that while humans find the case distinctions very useful and often important. In English we've settled on the user of upper case for indicating the beginning of a sentence and names. It also used to be used for important words. The problem is that the upper/lower case distinction is not considered significant for deciding if two things are the same. Since block writing is often done in upper case, Bob and BOB are considered to be the same name in practice.
At one point I had my office manager purchase labels so we could track equipment. Since we had both our own and borrowed equipment I ordered black permanent labels for what we owned and red removable ones for what we didn't own. It never occurred to me that she would use the same numbers for both sets. It was obvious to me that the color was not important in distinguishing entries in the database.
As an aside, back to those inked ribbons � typewriter used to have both red and black ribbons but they disappeared in the 1960's just when Xerox copies were coming on the scene. They didn't preserve the color distinction so accountants had to use () around a number to indicate a negative number rather than red. The idea of color distinction isn't really wrong. It's just that in practice we don't preserve that property. Because it is simpler to have a numeric (or text) key without arbitrary qualifiers and because humans don't normally say "red 3" (except in roulette).
In programming I've been using Visual Basic. BASIC as language predates case distinctions. I can use upper and lower case and don't have to worry about it. In fact, if I define a variable with an upper case letter, the parser will make all my references conform which is a helpful feedback.
It's amazing how much of this goes back to Multics � it just happened to be a coalescing point in the late 1960's. The ideas didn't necessarily originate there but it is where I learned about them. One idea was the archive file. You could use an archive to store a group of files efficiently. Thinking back, however, the archive was a bad idea since we should've just used directories and had the system provide a way to compress directories for efficiency. The archive was like a directory but since it was implemented as an application it was easy to add features including the ability to update backup copies in the archive (though I remember losing work because unlike the rest of Multics it didn't support daylight savings time changes).
The archive file lived on in Unix as the TAR file. The TAR file had an additional use in that it turned a tree structure into a stream of bytes that could be piped around. Unix pipes (borrowed form the Multics design documents) were useful for moving bits and information around. With a tar file I could just pass a whole set of files through a pipe and take them apart at the other end.
The ZIP file is essentially a compressed archive (why ZIP become the popular format is another story that I wont' go into here). It has two major purposes. One is compression and the other is to group a set of files under one name so they could be handled as a unit.
Historically ZIP files made a lot of sense since they were implemented as applications and thus allowed the users to solve their own problems. But architecturally it is a very bad idea and over time as the capabilities incorporated into the file system, the ZIP files which were useful though inconvenient, become simply annoying.
Even before we had compression built into the file system there were some implementations of ZIP files that made them act like extensions to the file system. This was a very good idea (even if the implementations were far from perfect). In fact, with XP one can treat the ZIP file as sort of a directory but only some operations work. Better to just use the directory itself. After all, with a compressing file system we gain little except, perhaps, for breakage. Breakage is a technical term for the space lost when a file doesn't quite fill out a block on disk. But this is easily handled within the file system.
The real problem with ZIP files, however, is when they are used as download units. Here too the idea of grouping files together for transfer isn't bad. The problem is that it is very visible and unnecessary. I should be able to treat a group of files as a unit like a directory. It wouldn't take too much to make this completely invisible. Given that a ZIP file does act like a quasi-directory under XP it shouldn't be too hard to make them completely transparent.
What is annoying is that ZIP files are used for compression for downloading and require that I manually unpack them for use. V.42 modems already do compression and for higher speeds compression is less of an issue but can still be done by cooperating file transfer programs. What is especially annoying is when people compress an image file such as JPEG which is already compressed. Another related problem is downloaded EXE's which are really nothing more than autoexpanding compressed files like ZIP. (Note that CAB files are simply a variation on the same theme). Microsoft seems to like to take a DOC file and package it as an EXE for downloading. Not only does this mean I have to waste my time dealing with this I also have to trust them enough to run a program with arbitrary power when I just want to look at a document. I also have to do things like respond to dialogs about where to unpack the files and how to dispose of the EXE.
Let's just stop this nonsense. Rather than putting effort into better support for this kludge, the goal should be to make it transparent. It would be very useful to extend the file transfer programs to support directories and other groups. This can be a negotiated extension so that one can still use old transfer programs. But removing unnecessarily manual intervention would be a significant step towards decreasing pervasive annoyance.
Drive letters and all that.
Enough for now. Next time I'll write about drive letters and other things that make it unnecessarily annoying to try to real use PC's in a dynamic environment. I'm also annoyed about AM/PM time. I remember once waking up at 8pm and rushing off to work until I realized it was really 8 pm and I had only slept four and not twelve ours. I also need write about the email and the concept of identity and how it relates to spam.