A base 64 numbering system using only print-safe and URL-safe ASCII numbers and letters.
or
a side effect of creating a variable precision short geo URL encoding scheme.
Tantek Çelik, 2010-075
shortlink: http://ttk.me/w/NewBase64
I wanted to create a variable precision encoding for geo coordinates (latitude and longitude) in order to represent them via short URLs, and determined that octal coordinates would compress pair by pair into base 64 character by character (OctalGeo). Given that I had already developed a print/URL/ID safe base 60 numbering system from ASCII characters (NewBase60), it seemed logical to start with that to see if it was possible to extend it to base 64, while minimally relaxing the original design constraints and maintaining print/URL safety at a minimum.
See first the "URL shortening" and "Methodology" sections of NewBase60, as those both apply here.
Since NewBase60 used up all the characters that satisfied the "Identifier compatibility" design methodology criteria, specifically "HTML ID and NAME tokens", this constraint must be relaxed in order to find four more characters for encoding the decimal numbers 60, 61, 62, 63.
The drivers of the "Identifier compatibility" constraint were URLs, HTML ID attributes, and identifiers in computer languages. The latter two lack the necessary four remaining characters, and thus we drop them.
$ - _ . + ! * ' ( ) , from Section 2.2 of RFC 1738
The complete numerical sexagesimal sequence grouped:
The complete sequence written out in 8 rows of 8 for readability:
0 1 2 3 4 5 6 7 0 0 1 2 3 4 5 6 7 8 8 9 A B C D E F 16 G H J K L M N P 24 Q R S T U V W X 32 Y Z _ a b c d e 40 f g h i j k m n
48 o p q r s t u v 56 w x y z - + * $
The following CASSIS V0 code (runs in PHP and javascript with the inclusion of cassis0.js from the CassisProject) is licensed under a Creative Commons Attribution Share-Alike license (for now). Please attribute to "Tantek Çelik" and link attribution to http://tantek.com
function numtob64($n) { $s = ""; $m = "0123456789ABCDEFGHJKLMNPQRSTUVWXYZ_abcdefghijkmnopqrstuvwxyz-+*$"; if ($n===undefined || $n===0) { return 0; } while ($n>0) { $d = $n % 64; $s = strcat($m[$d],$s); $n = ($n-$d)/64; } return $s; } function numtob64f($n, $f) { $s = numtosxg($n); if ($f===undefined) { $f=1; } $f -= strlen($s); while ($f > 0) { $s = strcat("0",$s); --$f; } return $s; } function b64tonum($s) { $n = 0; $j = strlen($s); for ($i=0;$i<$j;$i++) { // iterate from first to last char of $s $c = ord($s[$i]); // put current ASCII of char into $c if ($c>=48 && $c<=57) { $c=$c-48; } else if ($c>=65 && $c<=72) { $c-=55; } else if ($c==73 || $c==108 || $c==33) { $c=1; } // typo capital I, lowercase l, ! to 1 else if ($c>=74 && $c<=78) { $c-=56; } else if ($c==79) { $c=0; } // error correct typo capital O to 0 else if ($c>=80 && $c<=90) { $c-=57; } else if ($c==95) { $c=34; } // underscore else if ($c>=97 && $c<=107) { $c-=62; } else if ($c>=109 && $c<=122) { $c-=63; } else if ($c==45) { $c=60; } // dash else if ($c==44) { $c=61; } // plus else if ($c==42) { $c=62; } // asterisk else if ($c==36) { $c=63; } // dollar sign else { $c = 0; } // treat all other noise as 0 $n = 64*$n + $c; } return $n; }
Met Justin Kruger and learned about his loc.is service from Kevin Marks. As far as I know this is the first instance and idea of a geo short URL which has character by character increasing precision (albeit only after the first 2 characters). Seemed like a really good idea (geo short URL, increasing precision character by character), but certainly appeared to have some (addressable) weaknesses:
While the character-by-character increasing 2D precision is certainly a clever idea, and the loc.is implementation is quite nice, I was convinced that the above mentioned weaknesses could/should be addressed. A NewBase64 (as described on this page) solves weakness #3. My thoughts for a fully OctalGeo which uses an initial 32 square (4x8) grid with further precision specified in 1/64 squares to replace latitude and longitude solves weaknesses #1 and #2.
On a flight SWA396 from DEN gate C30 to AUS, I finally brainstormed a specific description of the idea of a new Base64 for use in URLs but not ID attributes that would form the basis of an OctalGeo solution. Raw notes:
Base64 for URLs only (not id attributes)
For use in mathematical/2d encoding scenarios (eg geo coordinates 6 bits per digit of 2d precision = 3bits lat and long)
Need 4 more chars in addition to base60
Additional chars to consider: +-*.,^=()[]{}@:;~
Prefer less ambiguous chars.
Reserve chars that could serve other funcs:
+ means space in URLs, avoid using whitespace to encode
:; avoid using : and ; to permit latter reuse for property:value; syntax
@ for @ rules (like in CSS) and twitter handles
{} for code/declaration blocks
, for coordinates, lists, sets
?=& reserved for URL query params
| collapse to 1
! collapse to 1, also trailing sentence punctuation
. fails because as trailing is sentence punctuation - could reuse inside and as leading char for fractional values. Can also use as a terminator to indicate baseline (in case of --- or ____ numbers)
- is very reasonable, already part of many property names
^ seems workable, means "1" in sumerian, could use for 61
~ tilda is reasonable
$ is reasonable, already part of var names in js
That progression is nice as it is linear/monotonic with # of strokes: - ^ ~ $ for 60-63. small dash - after z also nicely parallels big underscore _ after Z.
Also, $ as the "last" char is nicely easy to remember as it means end/suffix in regexes.
This first attempt was close, but unfortunately the characters ^ and ~ are considered "unsafe" by RFC1738, "because gateways and other transport agents are known to sometimes modify such characters", and thus this is suboptimal for a URL-safe base 64. I discovered this when writing up my new base 64 thoughts (while online and able to check RFC1738), and chose other characters accordingly as described in the methodology section.
However, this initial on-the-flight brainstorm was quite close to what I eventually came up with, only needing to replace ^ and ~ with + and *, and maintaining the number of strokes progression.
There have been a few Base64 encodings proposed and put into use, however, none of them satisfied the above design constraints.
Problems:
Problems:
While this variant of the RFC 3548 Base 64 Encoding does solve the URL-safe problem (#4 above), by replacing the use of the "/" with "_", it still has the first three problems:
In short, both RFC 3548 Base 64 encoding variants have the primary problem of not being unambiguously print-safe, and thus unsuitable for use in any form of shortURLs, or any other material which might be printed.