| 
View
 

Whistle

Page history last edited by Tantek 1 year ago Saved with comment

Whistle personal URL shortener

 

Whistle is an open source, algorithmically reversible, personal URL shortener.

There is an instance of Whistle running at http://ttk.me.

 

Note: if you're looking for an open source, database-based, generic (any) URL shortener, see other projects like yourls.org, etc. feel free to suggest others.  (previously ur1.ca (that's U R ONE dot C A) which subsequently died).

 

building blocks

Whistle makes use of the following building blocks:

 

clients

Whistle is used by:

 

design requirements

summary of design requirements for Whistle

 

  • personal URL shortener
  • personal short URL domain
  • algorithmic shortening (no database/opaque id indirection)
  • human readability / print safety (ShortURLPrintExample)
  • as short as possible (NewBase60 compression)
  • per content-type URL space partitioning (one character content-types)

 

simple example

Here is a sample actual permalink URL and how it's converted to a permashortlink:

 tantek.com/2010/034/t2/diso-2-personal-domains-shortener-hatom-push-relmeauth

The slug is purely for display and search, the unique portion of the URL is actually:

 tantek.com/2010/034/t2

this compresses to the following Whistle short URL:

 ttk.me/t4432

via /t2 = 't' text note number '2' for the day, and

/2010/034 == 2010-034 ordinal ISO8601 date == '443' in sexagesimal NewBase60 epoch days, and

ttk.me is the short domain for tantek.com.

This algorithm is reversible and thus inverting it resolves the short URL into the original permalink.

 

why

Why short URLs, why is it necessary to run your own URL shortener, why the Whistle design, and how does it work?

 

why short URLs

Historically the need for "short" URLs is not new. But what we mean by "short" has certainly changed a lot recently.

 

Old email systems would wrap text at 80 characters (or a few less) and make it just harder enough to reliably reconstruct or use URLs that many systems adopted a common practice of keeping URLs to 70 characters or less by design.

 

This was fundamentally usability driven.

 

Shorter URLs in email are easier to use and more reliable. They're nicer in IM too.  And browser screenshots. I've retyped URLs from screenshots in slides, I'm sure many of you have too.

 

How about print? Ever typed in a URL from a book? Or advertising. Magazine spreads or billboards - URLs are ubiquitous.  See ShortURLPrintExample for actual documented real-world examples of short URLs in print.

 

The easier to read and type-in, the more folks visit the URL.

 

But again, this is nothing new. Ever since the dotcom boom URLs have become a part of our visual language (much to the chagrin of linguists I'm sure).

 

why do you need your own

Why do you need your own URL shortener and short domain?

 

Twitter rewrote our brains to think in 140 characters and suddenly every one of those characters counted.

 

And two things happened:

  1. URL shortener services showed up which would trim any URL down to a small handful of characters, saving your precious tweetspace for your own words. Everyone started using them. Twitter and clients started auto-shortening URLs.
  2. We started to understand just how fragile these shorteners are, and how they break the web. How many shortener sites have died taking their links with them to the bit bucket? Even tr.im, which is keeping the lights on longer than others, is set to shut down at the end of 2010. It was frustration with tr.im's downtimes and then end-of-service announcement that led me to this realization: It's not good enough to have your own URL; You need to have your own shortener as well. 

 

This isn't just for independents. Companies and hosting services should have their own too. The first big site to realize and do this right was Flickr, and many have followed.

 

The key here is that when you own and host your own shortener for links to your content, you're not adding any more fragility to web. If your shortener goes down, your site probably is down as well. They're tied together. No additional risk. Unless you use a database for the shortenings and you lose your database because you were unwilling (or unable) to pay the DBA tax to maintain it. We'll talk more about the DBA tax problem in due time.

 

But why is it important to own the shortened links to your content? Why not just always share your full "long" URLs?

 

In short:

  1. You can't always do so. E.g. Twitter now auto-shortens many URLs.
  2. Shorter URLs tend to be better for sharing (for all the reasons discussed at the top). 

 

And that #2 is where we get to DiSo.  A couple of the key architectural components of DiSo 2.0 are:

  1. Publish on your own site, own your URLs, your permalinks, and
  2. Syndicate out to other sites. Your text updates to Twitter, your checkins to Foursquare, your photos to Flickr etc.

 

The direction of the content flow is very important here, as it has to do with ownership, and what's the original vs. what's just a copy.

 

It's ok to sharecrop copies, especially when the copies link back to your original. That's called distribution.

 

It's not ok to sharecrop the original and aggregate copies on your own site. You're still sharecropping and you're still beholden/vulnerable to those 3rd party sites going down, censoring your content, renaming you, or being blocked by some nationwide internet filtering firewall.

 

In a DiSo solution, when you syndicate your content out to other sites, the key is that those syndicated copies of your content link back to the original. Permalinks serve this role for blog posts. For short text updates that you syndicate to Twitter or Identi.ca etc., you need perma-short-links. And that's where your own shortener is essential

 

why an algorithmic URL shortener

One of the key emphases of the DiSo 2.0 I've outlined is maintainability. Fewer moving parts, fewer magic hidden files, fewer things that can inexplicably fail = more independents succesfully running and owning their own sites, identities, web presences over time.

 

Nearly all (maybe all?) open source URL shorteners today use a database to store the pairs of "short code" and "actual URL". If you lose that database, forget to back it up, have some bad database code that corrupts it etc., your shortlinks are gone, dead, useless.

 

If instead you create and use a URL shortener to create shortlinks that are algorithmically reversible, and then document that algorithm, publicly, then anyone can figure out how to expand your shortlinks. If they happen upon them on some random site, they can expand them and look for the original, or at least know that you're linking to the same thing that a normal permalink somewhere else is expressing.

 

In addition, all manner of browser or aggregator tools and sites that currently have to manually resolve shortlinks by calling the APIs of their services can save the bandwidth and time and simply decode your URLs themselves.

 

Once again, Flickr set a very good example with their http://flic.kr/ shortener for Flickr photos.

 

In fact, their doing so inspired me within days to grab http://ttk.me/ and set it up to redirect to my site http://tantek.com/, knowing I would eventually (as I have) add various shortening services to it.

 

Similarly I encourage every independent out there, everyone who wants to install and/or run their own DiSo implementation (like Falcon), to go ahead and not just grab a domain name for themselves, but also grab a shortener domain too. Set it up to redirect to your primary domain for now.

 

why human readability

Short URLs are used in contexts where humans end up reading them and typing them in by hand. Some examples:

  • ShortURLPrintExample - when present in print, humans are expected to type in the URL by hand. Thus the need for short URLs to be "print safe" (a subset of human readability)
  • "faster" to view/retype than copy/paste/share. There are some proprietary platform applications, such as Instagram on iOS, for which it is faster to view the URL of a post in the app, and then retype it into a laptop, than attempt to send/share from an iOS device (e.g. without any email setup) to a laptop. Instagram actually itself ironically produces print-unsafe URLs such as:
    • instagr.am/p/GTIm_ (note the capital "i" after the "GT", which could ambiguously be read as a lowercase "L" (especially adjacent to the lowercase "m", or as the number "1")

 

why one letter content type codes

Two more things that Kellan got right in the Flickr shortener which I've also found inspiration in:

 

  1. just a "/p/" to indicate "Photo" presumably (clever idea to prefix like that to allow for other prefixes to do other things)
  2. and then a Base58 compressed photo id. 

 

Regarding 1, I've also settled on one-character "spaces" for different types of URLs. "p" for photo makes sense to re-use. After quite a bit of personal research into what types of content are different enough and used often enough to warrant their own short URL spaces, I've come up with about 20 different content types, each with their own letter.

 

Here are a few examples from my content-type short codes:

 

  • b - blog post, article (structured, with headings), essay
  • i - identifier - on another system using subdirectories as system id spaces
    • i/i/ - compressed ISBN numbers
    • i/a/ - compressed ASIN numbers
  • p - photo
  • t - text, (plain) text, tweet, thought, note, unstructured, untitled 

 

 

These and the rest are documented more fully in the "design" section below.

 

 

how do the t short URLs work

Specifically for text notes, I decided to keep my "t" shortener as short as possible, which meant dropping a trailing "/".

 

After that I use a 3 digit sexagesimal (Base60) number to represent the date in a manner deliberately limited to human individuals. Why Base60? Lots of reasons, including print-safety (as mentioned above). Want to read the entire derivation and reasons why? See NewBase60 (includes open source CASSIS implementation).

 

Why 3 sexagesimal digits to represent the date? It turns out that 3 sexagesimal digits are capable of representing over 500 years of days - plenty overengineered for any human lifetime. And if anyone does figure out how to live more than 500 years I have a feeling that person will not only not resemble human as we know it very much, but will either have bigger problems to deal with than URL shortener limitations, or will be so smart that they will come up with a better solution.

 

But for now, for our feeble less than 200 year lifetimes, this is good enough. In addition we can even agree on a day zero that computes well with existing platforms. Unix Epoch start: 1970-01-01. Given that no-one published anything to the web before 1990, I think we're ok with that. What happens in a few hundred years? Perhaps people can pick their own day zeroes as they see fit.

 

Thus the 3 characters after the "t" represent the number of days since 1970-01-01 in sexagesimal - what I'm calling "epoch days".

 

Finally I allow for 1 (or 2, but haven't needed it yet) more sexagesimal digit to indicate the nth ordinal post of that type for that day. Thus:

 

ttk.me/tSSSn

 

  • SSS = sexagesimal epoch days
  • n = nth post that day

 

This is sufficient to expand to:

 

tantek.com/YYYY/DDD/tn/

 

 

Which I then redirect server-side to a longer URL with post keywords (AKA "slug") on the end. E.g.

 

ttk.me/t4432 is

 

  • t - text note
  • 443 - 443(base60)th epoch day = 2010, the 34th day of
  • 2 - 2nd text note that day

 

thus expands to:

 

tantek.com/2010/034/t2/

 

which is enough for Falcon to retrieve the post in the hAtom store, where it also gets the keyword/slug phrase for the post, and uses it to redirect it to: 

 

tantek.com/2010/034/t2/diso-2-personal-domains-shortener-hatom-push-relmeauth 

 

design

Design notes:

  • single-letter content-type prefix, description, and ActivityStreams equivalent object-type if any
    • a - audio/video recording, speech, talk, session, sound, animation. temporal media content typically with a play/pause/scrubber interface.
    • b - blog post, article (structured, with headings), essay
    • c - code, sample code, library, open source, code example
    • d - diff, edit, change, including tag-of and especially person-tag reply posts
    • e - event - hCalendar
    • f - favorited - primarily just a URL, often to someone else's content. for more, see 'r' below 
    • g - geo checkin - act of checking into a location, venue. e.g. dodgeball, foursquare
    • h - hyperlink - e(x)ternal reference, link, etc. use of short URL to link to things that I expect to die or move, untrustworthy permalinks. 
    • i - identifier - on another system using subdirectory as system id space
    • j - reserved
    • k - reserved
    • l . (skipping due to resemblance to 1, per print-safety design principle, related: ShortURLPrintExample)
    • m - (message like email, permalink to external list archive, or private blog archive, or a sender-hosted message)
    • n - reserved
    • o - physical objects (e.g. stuff from Amazon, or URLs attached to actual specific physical objects) 
    • p - photo (re-using Flickr's design choice of flic.kr/p/ for photo short URLs)
    • q - reserved
    • r - review, recommendation, rating - h-review / hReview
    • s - slides, session presentation, S5 
    • t - text, (plain) text, tweet, thought, note, unstructured, untitled 
    • u - (update, could be used for status updates of various types, profile updates)
    • v - venue, physical location (possibly moving or non-stationery), for referencing as an event location, or in a geo checkin.
    • w - work, work in progress, wiki, project, draft, task list, to-do, do, gtd
    • x - XMDP Profile 
    • y - reserved
    • z - reserved 

 

  • b - blog post specific short URL design: /b/SSSn (prefer /x/ design per Flic.kr precedent, extensibility)
  • t - text note specific short URL design: /tSSSn (as short as possible to for max chars for the note)
    • SSS - NewBase60 epoch days
    • n - nth post for the day

 

Replies, responses, rebuttals used to be part of 'r', but in practice are more of an aspect of a post rather than type of post in and of themselves. A 't'ext note can be a reply, and a 'b'log post can also be a reply. The "replyness" of a post is determined by the presence of a link with in-reply-to markup

 

I've been posting RSVPs in practice as 't' notes with both an in-reply-to link to an event post, and an explicit p-rsvp property. The presence of the p-rsvp property is sufficient to determine that it is an RSVP post, or subtype of text note.

 

under consideration

  • q - question post, e.g. the kind of post people post to Quora or Stack Overflow. See sample question on Quora.
  • m - metric, measurement of self, i.e. quantified self type data, weighing, movement, walking, running, biking, driving etc.
    • and thus collapse "message" as just a use case of either text note (simple text/SMS/email messages), or article/blog post (messages with an explicit subject).
  • ?x - scrobbles/listens/watchings/viewings - I think those are all the same kind of post, something passive that is about your environment more than it is about you an "experience" post. not sure what shortcode I'd use for them. I could pick from j,k,y,z (reserved) OR I could use 'x' since that is in "experience", "external", and simply drop XMDP profile as I haven't posted any of those on my own site and don't expect to.

 

design related analysis

Others single-letter post type schemas / short URL designs:

  • My Post Type Schema by Barnaby Walters - "After seeing Tantek Çelik’s Whistle URL Shortener Design Notes I decided to base my schema off his."
  • SHORT URL NAMESPACES by Shane Becker - "... internal notes about my Homesteading post types plan ..."
  • ...

 

implementation

Whistle has an implementation of the following:

  • single-letter content-types (on ttk.me for tantek.com)
    • b - blog post, article (structured, with headings), essay
    • i - identifier - on another system using subdirectory as system id space
    • t - text, (plain) text, tweet, thought, note, unstructured, untitled
    • w - work, work in progress, wiki, project, draft, task list, to-do, do, gtd
      • for now the 'w/' short URLs simply redirect to this wiki, eventually they'll redirect to wiki pages on tantek.com, likely hosted/published by Falcon
  • single-letter content-types (on ufs.cc for microformats.org)

 

interviews

Interview by Steve Ivy published on monkinetic:

 

FAQ

Why not use days since you were born

Q: Why not use days since you were born instead of days since epoch start (1970-01-01) ?

A: In short: 1. easier debugging, 2. birthday privacy. First, from a practical perspective, reusing epoch start makes it easier to debug: 0 datestamp means 0 epoch time, everyone's personal permalinks share the same NewBase60 datestamps etc.  And second, using your birthday as your 0-day for permalink datestamps would have the side-effect of publishing your precise year/month/day of your birthday which not everyone may want to do - in fact, typically people still keep their full birthday private rather than publishing it openly on the web. Long term if this encoding scheme is still used in say 200+ years, it may make sense to pick a new day zero for folks born after a certain point in time (e.g. perhaps 2200-001 for everyone born on that day or later.).

 

Why not just use the short URLs all the time?

Q: Why redirect? Why not just use the short URLs all the time?

A: In short: 

1. Aesthetic friendliness. Compressed characters (even NewBase60) look like line noise errors to most people.

2. Branding. "tantek.com" (or whatever your own personally recognizable domain) has value purely by name.

3. SEO. Using a long URL with a title/keyword slug as the final segment helps with better search result placement.

 

Why use short ids instead of URLs in syndicated notes?

Q: Why use permashortids like (ttk.me t4MY1) in syndicated tweets instead of URLs like http://ttk.me/t4MY1?

A: In Twitter (and other short messaging systems), there's a strong user expectation that any link will provide more content than the tweet (short text note) itself. When a link merely shows the user what they've already seen (even if it is a nicer UI, the original posting URL etc.), they tend to get upset (went through an actual iteration of this, with plenty of feedback from colleagues - some even publicly posted ;). So now I only provide a full permashortlink URL at the end of my syndicated text notes if there is more content at the original, e.g. photos, videos, expansion of elided text, etc.

 

Why use space as a delimiter inside permashortids?

Q: Why use space instead of punctuation as a delimiter inside permashortids?

A: I actually started with using a slash "/" as a delimiter because it made the permashortids URL-like (recognizable) without making them clickable in Twitter - for a while until they changed their auto-linker to link them up (which defeated the purpose of making them short ids (see previous FAQ).

 

Once back to the drawing board, I reanalyzed the problem and realized that the permashortid at the end of my tweets is essentially a shorthand citation to the original work. Thus I researched existing shorthand citation conventions / formats and re-used accordingly.

Both Harvard and The Chicago Manual of Style (click the "AUTHOR-DATE" tab to see examples) use the format:

(author date)

[for more on citation formats/styles, I've collected some research here: http://microformats.org/wiki/citation-formats#styles, naturally :) ]

Thus for my posts the analogy to

 

(author date)

 

is

 

(shortdomain datedID)

 

e.g.

 

(ttk.me t4MY1)

 

where


Also from a readability design perspective, even if you didn't have the prior art, alternatives like more punctuation (a ":" or an "=") are noisier (among the text) and uglier too (for humans). Punctuation should only be used when doing so is an improvement over not. Since the parantheses are already containing the citation information, a space is a reasonable delimiter - no colon or equals or other symbols necessary.

 

 

Related Projects

 


Return to MyNextStartup \ FrontPage.

Comments (0)

You don't have permission to comment on this page.