New Base 64

New Base 64

A base 64 numbering system using only print-safe and URL-safe ASCII numbers and letters.

a side effect of creating a variable precision short geo URL encoding scheme.

Tantek Çelik, 2010-075

introduction

I wanted to create a variable precision encoding for geo coordinates (latitude and longitude) in order to represent them via short URLs, and determined that octal coordinates would compress pair by pair into base 64 character by character (OctalGeo). Given that I had already developed a print/URL/ID safe base 60 numbering system from ASCII characters (NewBase60), it seemed logical to start with that to see if it was possible to extend it to base 64, while minimally relaxing the original design constraints and maintaining print/URL safety at a minimum.

background

See first the "URL shortening" and "Methodology" sections of NewBase60, as those both apply here.

Since NewBase60 used up all the characters that satisfied the "Identifier compatibility" design methodology criteria, specifically "HTML ID and NAME tokens", this constraint must be relaxed in order to find four more characters for encoding the decimal numbers 60, 61, 62, 63.

The drivers of the "Identifier compatibility" constraint were URLs, HTML ID attributes, and identifiers in computer languages. The latter two lack the necessary four remaining characters, and thus we drop them.

methodology

NewBase60 reuse - reuse symbols used from NewBase60 numerical notation for the same meaning.
ASCII

Use stricly ASCII7 characters - this provides maximum robustness for transmission of content to mobile devices, applications etc.
Use ASCII ordering. If a character is latter in the ASCII order, it should designate a higher value. If additional characters are earlier in ASCII order than existing NewBase60 characters, then relax this constraint completely in deference to the NewBase60 constraint.

Identifier compatibility

URL characters not requiring encoding per http://www.ietf.org/rfc/rfc1738.txt

$ - _ . + ! * ' ( ) , from Section 2.2 of RFC 1738

URL characters that are unambiguous

Unambiguous readability. Skip latter characters that are often rendered the same or nearly the same in common small fonts (e.g. in monospace text editors or mobile phone displays) as earlier characters. Enable roundtripping from visual media (i.e. from print output - see ShortURLPrintExample - to manual human or OCR input).
Consider use contexts. As NewBase64 numbers may/will be used in prose or perhaps data structure literal constructs, the contexts of these languages/environments must be considered when judging/evaluating the feasibility of using any specific characters.
Reserve other functionality. Reserve characters that should or will likely serve other functionality.
Reasonable ordering mnemonic. If lacking the ASCII ordering constraint, develop a reasonable alternative for ordering extra characters, preferably one with a good mnemonic.

derivation

Starting with NewBase60 compatibility, values 0-59 are represented by [0-9A-HJ-NP-Z_a-km-z]
look at additional URL safe characters that do not requiring encoding
use '$'
use '-'
skip '.' because used for decimal values, and sentence terminator (thus when NewBase64 is embedded in prose we won't accidentally include the period at the end of a sentence).
use '+'
collapse '!' with 1 (I,l) per unambiguous human readability
use '*'
skip single-quote (') per use context likelihood of quoted values.
reserve '(' and ')' for grouping functionality
reserve ',' for coordinates or list functionality
because all of the "use" characters above are earlier in ASCII order than NewBase60 "0", relax the "Use ASCII ordering" constraint accordingly to maintain 0-59 equivalent mapping with NewBase60.
Use reverse ASCII ordering for the four new "use" characters, which are also in approximate shape/stroke order: - + * $

Another mnemonic is that $ means end of string in regex, and thus makes a good "last" symbol in the numbering system.

New Base 64 numbering system

The complete numerical sexagesimal sequence grouped:

[0-9A-HJ-NP-Z_a-km-z\-+*$]

The complete sequence written out in 8 rows of 8 for readability:

   0 1 2 3 4 5 6 7
 0 0 1 2 3 4 5 6 7
 8 8 9 A B C D E F 
16 G H J K L M N P 
24 Q R S T U V W X 
32 Y Z _ a b c d e
40 f g h i j k m n

48 o p q r s t u v 
56 w x y z - + * $

open source implementation

The following CASSIS V0 code (runs in PHP and javascript with the inclusion of cassis0.js from the CassisProject) is licensed under a Creative Commons Attribution Share-Alike license (for now). Please attribute to "Tantek Çelik" and link attribution to http://tantek.com

Cassis V0 code

runs in both PHP and javascript

function numtob64($n) {
 $s = "";
 $m = "0123456789ABCDEFGHJKLMNPQRSTUVWXYZ_abcdefghijkmnopqrstuvwxyz-+*$";
 if ($n===undefined || $n===0) { return 0; }
 while ($n>0) {
   $d = $n % 64;
   $s = strcat($m[$d],$s);
   $n = ($n-$d)/64;
 }
 return $s;
}

function numtob64f($n, $f) {
 $s = numtosxg($n);
 if ($f===undefined) { 
   $f=1; 
 }
 $f -= strlen($s);
 while ($f > 0) { 
   $s = strcat("0",$s); 
   --$f; 
 }
 return $s;
}

function b64tonum($s) {
 $n = 0;
 $j = strlen($s);
 for ($i=0;$i<$j;$i++) { // iterate from first to last char of $s
   $c = ord($s[$i]); //  put current ASCII of char into $c  
   if ($c>=48 && $c<=57) { $c=$c-48; }
   else if ($c>=65 && $c<=72) { $c-=55; }
   else if ($c==73 || $c==108 || $c==33) { $c=1; } // typo capital I, lowercase l, ! to 1
   else if ($c>=74 && $c<=78) { $c-=56; }
   else if ($c==79) { $c=0; } // error correct typo capital O to 0
   else if ($c>=80 && $c<=90) { $c-=57; }
   else if ($c==95) { $c=34; } // underscore
   else if ($c>=97 && $c<=107) { $c-=62; }
   else if ($c>=109 && $c<=122) { $c-=63; }
   else if ($c==45) { $c=60; } // dash
   else if ($c==44) { $c=61; } // plus
   else if ($c==42) { $c=62; } // asterisk
   else if ($c==36) { $c=63; } // dollar sign
   else { $c = 0; } // treat all other noise as 0    
   $n = 64*$n + $c;  
  }  
  return $n; 
 }

history

2010-014

Met Justin Kruger and learned about his loc.is service from Kevin Marks. As far as I know this is the first instance and idea of a geo short URL which has character by character increasing precision (albeit only after the first 2 characters). Seemed like a really good idea (geo short URL, increasing precision character by character), but certainly appeared to have some (addressable) weaknesses:

First two characters treated "specially" (Nautical time zone, longitudinal region) rather than as indicators of an index into a grid.
Starting unit of area precision is a rectangle (rather than a square).
Character encoding uses non-print-safe characters, e.g. loc.is/1S4Ql (trailing lowercase L)

While the character-by-character increasing 2D precision is certainly a clever idea, and the loc.is implementation is quite nice, I was convinced that the above mentioned weaknesses could/should be addressed. A NewBase64 (as described on this page) solves weakness #3. My thoughts for a fully OctalGeo which uses an initial 32 square (4x8) grid with further precision specified in 1/64 squares to replace latitude and longitude solves weaknesses #1 and #2.

2010-071

On a flight SWA396 from DEN gate C30 to AUS, I finally brainstormed a specific description of the idea of a new Base64 for use in URLs but not ID attributes that would form the basis of an OctalGeo solution. Raw notes:

Base64 for URLs only (not id attributes)

For use in mathematical/2d encoding scenarios (eg geo coordinates 6 bits per digit of 2d precision = 3bits lat and long)

Need 4 more chars in addition to base60

Additional chars to consider: +-*.,^=()[]{}@:;~

Prefer less ambiguous chars.

Reserve chars that could serve other funcs:

+ means space in URLs, avoid using whitespace to encode

:; avoid using : and ; to permit latter reuse for property:value; syntax

@ for @ rules (like in CSS) and twitter handles

{} for code/declaration blocks

, for coordinates, lists, sets

?=& reserved for URL query params

| collapse to 1

! collapse to 1, also trailing sentence punctuation

. fails because as trailing is sentence punctuation - could reuse inside and as leading char for fractional values. Can also use as a terminator to indicate baseline (in case of --- or ____ numbers)

- is very reasonable, already part of many property names

^ seems workable, means "1" in sumerian, could use for 61

~ tilda is reasonable

$ is reasonable, already part of var names in js

That progression is nice as it is linear/monotonic with # of strokes: - ^ ~ $ for 60-63. small dash - after z also nicely parallels big underscore _ after Z.

Also, $ as the "last" char is nicely easy to remember as it means end/suffix in regexes.

This first attempt was close, but unfortunately the characters ^ and ~ are considered "unsafe" by RFC1738, "because gateways and other transport agents are known to sometimes modify such characters", and thus this is suboptimal for a URL-safe base 64. I discovered this when writing up my new base 64 thoughts (while online and able to check RFC1738), and chose other characters accordingly as described in the methodology section.

However, this initial on-the-flight brainstorm was quite close to what I eventually came up with, only needing to replace ^ and ~ with + and *, and maintaining the number of strokes progression.

prior work

There have been a few Base64 encodings proposed and put into use, however, none of them satisfied the above design constraints.

RFC 3548 Base 64 Encoding

http://tools.ietf.org/html/rfc3548#section-3

Problems:

values 0-15 do not match up with hexadecimal (nor obviously NewBase60)
several out-of-ASCII-order sequences are used.
not print-safe. uses I (capital i), O (capital o), l (lowercase L), and numbers 1 and 0 for different values.
not URL-safe. use of "/" character makes it URL-unsafe or at a minimum implying subdirectory / path-component semantics when none are present.

with URL Safe Alphabet

http://tools.ietf.org/html/rfc3548#section-4