The Anatomy of a URL: Protocol, Hostname, Path, and Parameters all in one string of content
The ins and outs of URL structure, primarily geared towards web analysts, but probably useful to a larger population.
Put this post in the “very tactical” bucket, covering some things I’ve found myself explaining off and on over the years. Just last week, I wound up digging into an “oops” on some client work that was partially triggered by someone’s limited understanding of how URLs work., and, as I did a quick Google search to see if I could find a clean explanation of what I was trying to explain…I failed. Thus, a blog post was born.
Why Analysts Should Understand URLs
URLs are fundamental to the internet. And, while web sites are having their digital dominance chipped away by social media and mobile apps, URLs remain a core component of the Language of Digital.
For analysts, there are two key reasons that a solid grasp of URLs matters:
- Web analytics tools — Google Analytics, Adobe/Omniture Sitecatalyst, Coremetrics, Webtrends, and the like all pack a wheelbarrow’s worth of data into a customized URL every time a user takes a tracked action; you can see a 4-minute video on that subject or read a much more detailed explanation as to the mechanics of that process
- Pages on the site (and hackery therein) — some web analytics platforms use the URL (or some part of the URL) as the core means for reporting “Pages” data (Google Analytics, for one); some don’t (Sitecatalyst); either way, understanding the different components of a URL and how that affects the data feeding into your analytics tool (and how you can occasionally tweak a URL to get some supplemental data without doing a single lick of development on your site), is important!
Lengthy preamble complete… Let’s dive in!
The Anatomy of a URL
Although each URL is a single string of numbers, letters, and special characters, each URL has four distinct components:
- Protocol — always present
- Hostname — always present
- Path or Stem — always present…but sometimes is, basically, null
- Parameters — optional (but this is where some of the real fun can happen)
Below is a fictitious URL with each of these components identified:
When a URL is “executed,” a couple of things happen:
- The magic of the internet occurs (browser mechanics, DNS resolution, etc.) to actually get that request routed all the way through the interwebtubes to a web server somewhere; the mechanics of this are beyond the scope of this post.
- That web server interprets the request (the URL plus some other information that invisibly — and equally magically — comes along with it) and figures out what information needs to get sent back to the requestor
In that second step, the URL gets broken up into its four distinct components. And, if you actually start digging into web server logs, you will find that each of these components is stored in a separate “field” in each log entry. If you actually enjoy digging into web server logs, then, well, you’re not alone. You officially have one of the markers used to identify career digital analysts!
Component 1: The Protocol
The protocol is pretty fundamental, but it’s also the least interesting to a digital analyst. It’s simply an indication of what overarching framework is being used to transmit data back and forth:
- Far and away the most common is http, which stands for (did you know this?) “Hyper Text Transfer Protocol.”
- When people started buying stuff and accessing sensitive information over the internet years ago, a more “private” version of http came into being, which was https. What’s the “s” for? Well, “secure,” of course! When sites are accessed using https, it’s tougher to get at some data, but that protocol exists for a reason, so don’t start trying to hack your way around that. https was actually at the core of Google’s decision to start encrypting keyword search data for users who were logged into Google when they did searches, if you’ve been following or are affected by that kerfuffle.
- FTP is another fairly common format, which is used more for “file”-related data; FTP stands for “file transfer protocol.”
That’s really all there is to the protocol. It’s good to know what it is, but it’s not super interesting.
Component 2: The Hostname
The hostname is a bit more interesting than the protocol, and is, basically, the “domain” to which the URL is referring. The main hostname for this site is “gilliganondata.com.” But, “www.gilliganondata.com” also works, and the fact that both exist for the same content is where things start to get a little interesting.
The hostname can actually be broken down into several parts:
- .com (or .edu or .net or whatever) — this is actually the “TLD” or “top level domain.”
- “gilliganondata.com” is often referred to as the “domain” for the site, but that is not, strictly speaking, correct. Technically, “gilliganondata.com” is a subdomain of “.com.” But, almost no one talks about sites that way, so let’s just say that the domain is “gilliganondata.com”
- “www.gilliganondata.com” is actually a subdomain of “gilliganondata.com.” I could have multiple subdomains all hosted on different servers — search.gilliganondata.com, recipes.gilliganondata.com, etc. The “www” is something of a throwback convention and, usually, set up to work exactly the same as the base domain. BUT, every few months, I come across a site where <sitename>.com doesn’t load, but www.<sitename>.com does. This is purely a configuration miss on the part of the site owner that is easily fixed.
That last bullet was getting really long, wasn’t it? Subdomains do matter:
- If you’re not careful, “www.<yoursite>.com/” will get treated as a different page than “<yoursite>.com/” by search engines and/or your web analytics tool. That’s not good.
- If you have content hosted on a totally different system than your main site (a jobs board, a store locator, a discussion forum, etc.), a best practice is to create a new subdomain for that site but keep it under the same domain. This is usually very, very easy — do a Google search for “CNAME record” and you’ll be totally set on that front.
- There are cookie (visit and visitor identification) implications when it comes to the domains and subdomains in use on a site, but this post is going to be long enough without me diving into those. Trust me. Fewer domains is better.
So, even though the hostname is a pretty small part of the overall URL, it’s important, and there is interesting stuff that goes on with that component.
Component 3: The Path
The path (or stem) in the URL is analogous to the file path for a file on your computer. It often has an inherent drilldown/tree structure that uses “/”s in some organizing fashion. The path includes the filename, if there is one: index.htm, products.php, about.html, etc.
The path is somewhat static. That doesn’t mean you can’t have a content management system (CMS) that generates new paths like crazy, but, typically, each unique path represents either a core “page” of content or a core content template (that then uses parameters — which we’ll get to next — to update the actual content).
For news sites and blogs (including this one), you will often see “date” data built into the path structure (that’s what the “/2012/05/22/” in the URL of this post is — it’s showing that the post was originally published on May 22, 2012). For any site that cares at least a half of a whit about search engine optimization, you will see keywords relevant to the content as part of the URL (thus “the-anatomy-of-a-url-protocol-hostname-path-and-parameters” being in the path of this post).
There is a lot of flexibility in the path component of the URL, but the path ends — and this is an always-always-ALWAYS statement — when a question mark appears in the URL. A “?” in the URL is a demarcation that denotes the end of the path and the beginning of…
Component 4: The Parameters
Not all URLs include parameters. And, for web analytics campaign tracking purposes, parameters often get added to URLs for pages that were developed without giving parameters a second thought. That’s what makes them fun!
Parameters are nothing more than a list of variables in the URL. There is no limit (well, there are overall URL length limits, but lets not go there) to the number of parameters that can be included in a URL. But, there are a few hard-and-fast rules about parameters:
- They must be separated from the URL’s path using a “?”
- They must be separated from each other (when there are multiple parameters involved) using a “&” (this “must” is a little squishy — you can put subparameters inside of a single parameter using a little developer legerdemain…but that, too, is beyond the scope of this post)
- They must be structured as a “key-value pair.” The “key” is the name of the variable, while the “value” is the actual, well, value of the variable. The key goes on the left side of an “=” sign, and the value goes on the right side.
Key-value pairs are pretty simple to understand. You see them all the time as you browse the internet. Just look for “=” signs in URLs. All that the Google Analytics URL Builder for campaign tracking does is tack a series of key-value pairs on to the end of a protocol + hostname + path URL that you provide.
The order of parameters almost never matters!
Let’s say I had a URL that looked like this:
We have two parameters in this URL: “source” and “content.”
This URL would generally produce the exact same resulting content for the visitor:
All I did was change the order of the parameters. And, since they’re just a list of variables, sites typically won’t care about the order one whit.
Also (and I alluded to this earlier), you can generally add parameters to a URL without affecting the functionality of the page or what content gets displayed.
Let me repeat that, because it’s one of the keys to how web analytics tools capture traffic source data:
You can generally add parameters to a URL without affecting the functionality of the page or what content gets displayed.
When you add campaign tracking to a URL, you are doing something that the original developer of the content to which you are linking likely did not give a single thought. Try it on this page if you want to. Make up a key-value pair or two and tack them on the end of the URL for this page and see if the content changes. It won’t. Depending on what you tacked on, you’re probably introducing some squirrely data into my web analytics tools…but that’s okay. I’ll survive.
Parameters get used for lots of things:
- For web analytics campaign tracking
- To customize and personalize content that is presented to a visitor
- To drastically update the content shown on a page by using a parameter value to give the key piece of information as to what content/products/information should be displayed (this used to be much more prevalent, but it tends to have undesired SEO ramifications)
A single URL can include parameters that get used for many different purposes. As I noted, the order doesn’t matter. And, as I implied, most sites simply ignore parameters that they don’t recognize.
One caveat: occasionally, I come across a site where a developer took a shortcut in the implementation of the site such that unrecognized parameters do break the page. To date, I have never tracked down any of the handful of developers who have done this, so my desire to flog them has gone unfulfilled. “Extraneous” parameters should never break a site.
One more note: web analytics packages handle parameters in different ways:
- Sitecatalyst — since Sitecatalyst relies on pageNames rather than URLs, extra parameters don’t cause any web analytics issues
- Webtrends — historically (this might have changed), Webtrends stripped ohf all parameters in URLs by default and just used the hostname and path to identify pages; usually, this works fine, but there can be cases where you find you need the parameter to distinguish between different unique pages, and Webtrends has the ability to add those parameters back in through the configuration of the profile
- Google Analytics — by default, the only parameters that Google Analytics strips off of URLs are the Google Analytics campaign tracking parameters (utm_medium, utm_source, utm_campaign, etc.). But, you can go in and tell the tool to strip other parameters off as well.
Managing parameters effectively in your web analytics platform is one of those things that keeps your reports cleaner. If your site has, say, 300 basic pages, but your web analytics Pages report is maxxing out with 10s of thousands of rows, the chances are that you have a parameter management issue.
Bonus Component: #
I don’t know that I would consider the hash sign (or “fragment identifier”) as a core component of the URL, but it’s worth a mention. Hash signs — #s — at the end of URLs refer to locations within the main page. Most commonly, these get used as intra-page “bookmarks” of sorts. Both Wikipedia and FAQ pages tend to use these quite bit. For instance, if you view the source of this page, you will see the following in the HTML right at the beginning of this section:
And, if you tack “#bonus_component” onto the end of the URL for this page, the page will load and jump right down to this section.
Pretty Simple, Right?
I hope you found this helpful. URLs are key to the workings of the internet, and understanding their component parts and how you can both decipher them and manipulate them is one of those things that comes in handy when you least expect it!