NewBase64



New Base 64

 

A base 64 numbering system using only print-safe and URL-safe ASCII numbers and letters.

 or

a side effect of creating a variable precision short geo URL encoding scheme.

 

Tantek Çelik, 2010-075

shortlink: http://ttk.me/w/NewBase64

 

introduction

 

I wanted to create a variable precision encoding for geo coordinates (latitude and longitude) in order to represent them via short URLs, and determined that octal coordinates would compress pair by pair into base 64 character by character (OctalGeo). Given that I had already developed a print/URL/ID safe base 60 numbering system from ASCII characters (NewBase60), it seemed logical to start with that to see if it was possible to extend it to base 64, while minimally relaxing the original design constraints and maintaining print/URL safety at a minimum.

 

background

 

See first the "URL shortening" and "Methodology" sections of NewBase60, as those both apply here.

 

Since NewBase60 used up all the characters that satisfied the "Identifier compatibility" design methodology criteria, specifically "HTML ID and NAME tokens", this constraint must be relaxed in order to find four more characters for encoding the decimal numbers 60, 61, 62, 63. 

 

The drivers of the "Identifier compatibility" constraint were URLs, HTML ID attributes, and identifiers in computer languages. The latter two lack the necessary four remaining characters, and thus we drop them.

 

methodology

 

 

derivation

 

 

New Base 64 numbering system

 

The complete numerical sexagesimal sequence grouped:

 

The complete sequence written out in 8 rows of 8 for readability:

 

   0 1 2 3 4 5 6 7
 0 0 1 2 3 4 5 6 7
 8 8 9 A B C D E F 
16 G H J K L M N P 
24 Q R S T U V W X 
32 Y Z _ a b c d e
40 f g h i j k m n 
48 o p q r s t u v 
56 w x y z - + * $

 

open source implementation

The following CASSIS V0 code (runs in PHP and javascript with the inclusion of cassis0.js from the CassisProject) is licensed under a Creative Commons Attribution Share-Alike license (for now). Please attribute to "Tantek Çelik" and link attribution to http://tantek.com

 

Cassis V0 code

runs in both PHP and javascript

 

function numtob64($n) {
 $s = "";
 $m = "0123456789ABCDEFGHJKLMNPQRSTUVWXYZ_abcdefghijkmnopqrstuvwxyz-+*$";
 if ($n===undefined || $n===0) { return 0; }
 while ($n>0) {
   $d = $n % 64;
   $s = strcat($m[$d],$s);
   $n = ($n-$d)/64;
 }
 return $s;
}

function numtob64f($n, $f) {
 $s = numtosxg($n);
 if ($f===undefined) { 
   $f=1; 
 }
 $f -= strlen($s);
 while ($f > 0) { 
   $s = strcat("0",$s); 
   --$f; 
 }
 return $s;
}

function b64tonum($s) {
 $n = 0;
 $j = strlen($s);
 for ($i=0;$i<$j;$i++) { // iterate from first to last char of $s
   $c = ord($s[$i]); //  put current ASCII of char into $c  
   if ($c>=48 && $c<=57) { $c=$c-48; }
   else if ($c>=65 && $c<=72) { $c-=55; }
   else if ($c==73 || $c==108 || $c==33) { $c=1; } // typo capital I, lowercase l, ! to 1
   else if ($c>=74 && $c<=78) { $c-=56; }
   else if ($c==79) { $c=0; } // error correct typo capital O to 0
   else if ($c>=80 && $c<=90) { $c-=57; }
   else if ($c==95) { $c=34; } // underscore
   else if ($c>=97 && $c<=107) { $c-=62; }
   else if ($c>=109 && $c<=122) { $c-=63; }
   else if ($c==45) { $c=60; } // dash
   else if ($c==44) { $c=61; } // plus
   else if ($c==42) { $c=62; } // asterisk
   else if ($c==36) { $c=63; } // dollar sign
   else { $c = 0; } // treat all other noise as 0    
   $n = 64*$n + $c;  
  }  
  return $n; 
 }

 

history

 

2010-014

 

Met Justin Kruger and learned about his loc.is service from Kevin Marks. As far as I know this is the first instance and idea of a geo short URL which has character by character increasing precision (albeit only after the first 2 characters). Seemed like a really good idea (geo short URL, increasing precision character by character), but certainly appeared to have some (addressable) weaknesses:

 

  1. First two characters treated "specially" (Nautical time zone, longitudinal region) rather than as indicators of an index into a grid.
  2. Starting unit of area precision is a rectangle (rather than a square).
  3. Character encoding uses non-print-safe characters, e.g. loc.is/1S4Ql (trailing lowercase L)

 

While the character-by-character increasing 2D precision is certainly a clever idea, and the loc.is implementation is quite nice, I was convinced that the above mentioned weaknesses could/should be addressed.  A NewBase64 (as described on this page) solves weakness #3.  My thoughts for a fully OctalGeo which uses an initial 32 square (4x8) grid with further precision specified in 1/64 squares to replace latitude and longitude solves weaknesses #1 and #2.

 

 

2010-071

 

On a flight SWA396 from DEN gate C30 to AUS, I finally brainstormed a specific description of the idea of a new Base64 for use in URLs but not ID attributes that would form the basis of an OctalGeo solution. Raw notes:

 

Base64 for URLs only (not id attributes)

For use in mathematical/2d encoding scenarios (eg geo coordinates 6 bits per digit of 2d precision = 3bits lat and long)

Need 4 more chars in addition to base60

Additional chars to consider: +-*.,^=()[]{}@:;~

Prefer less ambiguous chars.

Reserve chars that could serve other funcs:

+ means space in URLs, avoid using whitespace to encode

:; avoid using : and ; to permit latter reuse for property:value; syntax

@ for @ rules (like in CSS) and twitter handles

{} for code/declaration blocks

, for coordinates, lists, sets

?=& reserved for URL query params

| collapse to 1

! collapse to 1, also trailing sentence punctuation

. fails because as trailing is sentence punctuation - could reuse inside and as leading char for fractional values. Can also use as a terminator to indicate baseline (in case of --- or ____ numbers)

- is very reasonable, already part of many property names

^ seems workable, means "1" in sumerian, could use for 61

~ tilda is reasonable

$ is reasonable, already part of var names in js

That progression is nice as it is linear/monotonic with # of strokes: - ^ ~ $ for 60-63. small dash - after z also nicely parallels big underscore _ after Z.

Also, $ as the "last" char is nicely easy to remember as it means end/suffix in regexes. 

 

This first attempt was close, but unfortunately the characters ^ and ~ are considered "unsafe" by RFC1738, "because gateways and other transport agents are known to sometimes modify such characters", and thus this is suboptimal for a URL-safe base 64. I discovered this when writing up my new base 64 thoughts (while online and able to check RFC1738), and chose other characters accordingly as described in the methodology section.

 

However, this initial on-the-flight brainstorm was quite close to what I eventually came up with, only needing to replace ^ and ~ with + and *, and maintaining the number of strokes progression.

 

prior work

There have been a few Base64 encodings proposed and put into use, however, none of them satisfied the above design constraints.

 

RFC 3548 Base 64 Encoding

Problems:

  1. values 0-15 do not match up with hexadecimal (nor obviously NewBase60)
  2. several out-of-ASCII-order sequences are used.
  3. not print-safe. uses I (capital i), O (capital o),  l (lowercase L), and numbers 1 and 0 for different values.
  4. not URL-safe. use of "/" character makes it URL-unsafe or at a minimum implying subdirectory / path-component semantics when none are present.

 

with URL Safe Alphabet

Problems:

While this variant of the RFC 3548 Base 64 Encoding does solve the URL-safe problem (#4 above), by replacing the use of the "/" with "_", it still has the first three problems:

  1. values 0-15 do not match up with hexadecimal (nor obviously NewBase60)
  2. several out-of-ASCII-order sequences are used.
  3. not print-safe. uses I (capital i), O (capital o),  l (lowercase L), and numbers 1 and 0 for different values. 

 

unsuitability of prior work

In short, both RFC 3548 Base 64 encoding variants have the primary problem of not being unambiguously print-safe, and thus unsuitable for use in any form of shortURLs, or any other material which might be printed.

 

references