Content Taxonomy Musings

My day job is running a bunch of systems as part of Microsoft’s evangelism group and the biggest of those is Channel 9 ( Channel 9 is basically a video blogging platform. We have ‘blogs’, which contain posts, written by people. Seems simple enough, but after running for 10 years we have a bit of a problem. Too much content.

For years now, the problem of ‘how should we categorize our content?’ has been floating around as a top issue, but no one ever wants to put it above any of the features they’d like added to the site. It is still an issue though, and something that is at the top of my mind nearly every day. This is not your typical technical blogger post though, where I describe a problem and our brilliant solution. This is truly just my thoughts on the topic, and hopefully it will spur some good discussion. If that discussion helps me get to a solution, that wouldn’t be so bad either!

The situation

Our current categorization system is based on three main attributes that are added to posts:

  • They live in a container of some sort (a blog, a show, an event). A post has to exist in one and only one container
  • They have one or more contributors (the people who made the video, posted the video or appeared in the video… the fact that it could be any of these is a problem for another post)
  • They have zero or more tags

While we could dig into the other two attributes, the key organizing factor in our system is really these tags. Tags are applied by the content creator, and are visible as a means of navigation for the users. Since there is no fixed set, no structure, and no enforced requirements in our tagging, I would classify our tagging system as a folksonomy (more specifically a narrow one, if you want to get into specifics). Essentially, every content creator is responsible for deciding what the right set of tags are for a piece of content. This gives great freedom, but produces terrible inconsistency.

At the moment this post was written, we have:

  • 60 thousand videos up on the site,
  • 10 thousand tags, and
  • 13 thousand videos that have no tags at all

We have over 1,000 new videos being published each month; the problem isn’t getting smaller over time. So what to do?

The first red flag was the fact that 20% of our videos had zero tags, so we now require that you tag the video with something before you can publish it. That helps, but it doesn’t fix the problem. Let’s look at some really good pieces of content on the site and see how they are tagged.

This is a great 10-minute video by Scott Hanselman, covering Azure Media Indexer. This feature of Azure, part of Azure Media Services, will produce captions automatically from your video file. Very cool, everyone who cares about media should watch this.

It’s current tags? Just one: “Azure”. That’s not terrible, it is about Azure after all, but is it sufficient?

The video is about Azure Media Indexer, a Microsoft service, so that would be a good addition. Azure Media Indexer is part of Azure Media Services, so Azure Media Services would be good to add as well. The point of the service is to make captions, which serve two main purposes: Accessibility and Find-ability (not a word, but it seems to fit). So, perhaps Accessibility would be a good addition, and maybe SEO? If you watch the video, Adarsh and Scott talk about how the output of Azure Media Indexer could be fed into a search engine like Azure Search, should we add that? At this point, it starts to be a bit debatable. You could say “yes, they mention Azure Search, so it should be included.” or “No, it isn’t really about it, it is just mentioned in passing… doesn’t count”. It is a judgment call, which means we probably have gone deep enough with “Azure, Azure Media Indexer, Azure Media Services, Accessibility, SEO”. That covers ‘what’ the video is about and even the ‘why’, but there is more going on here from the point of view of categorization. There is the ‘who’ (Scott Hanselman and Adarsh Solanki), the ‘how’ (it is an ‘Interview’), and just to be complete we could even think about the where (it is ‘In Studio’).

If we were posting one video, we could go really detailed with our tagging, but there is a cost to every additional tag we add. The more data we attach to the post, the less meaningful it is, because we are adding attributes that are not the most important aspects of the post. Also, adding these tags takes time, and given the 20% of videos that were posted with zero tags, we can conclude that our contributors do not want to spend a long time categorizing their videos. If we focus in on the topic of the video for now, and ignore the other details (who, how, where), we would be ok with “Azure, Azure Media Indexer, Azure Media Services, Accessibility, SEO”. I could edit this video now and add these tags, but we are trying to fix the system now for the next 60 thousand videos, not just this one, so more thought is required. If it is already possible to add these tags to a video, why wasn’t it already done? It seems like it isn’t a technical issue, but what can we do in our system to get better results?

I have a few ideas, either to make it quicker and easier to tag content, or to provide more structure in our tags to encourage better content attribution. Free flowing thoughts below…

Remember Past Tagging

When someone sat down to post this episode of Azure Friday, it was probably not their first post on Channel 9, and we know it wasn’t the first post in this show. What if we did some simple queries to find the most commonly used tags by the user and/or in the show they are posting to and ‘auto tagged’ the post? I have some fear that we’d suggest the wrong tags (if the last 10 posts were about Azure Websites, we’d suggest that for this video… and it could end up tagged wrong), so it would have to either be just a suggestion (more work for the user, they have to ‘pick’ the tags to include, still easier than a blank input box though) or be really easy to remove the auto-suggested tags.

Categorizing our tags, and then requiring tags in each category

When we dig through our thousands of tags, and when we look at the tagging in other systems in our family (like MVA), we see certain fundamental buckets that tags could be grouped in:

  • Product / Technology (ASP.NET, Azure, Windows, HyperV, etc.)
  • Scenario (Security, Web Development, Deployment, System Management, Testing and so on)
  • Audience (IT Pro, Developer, End User, Student, etc.)

If we moved from a single ‘Tags:’ field in our admin to a set of three (‘Products/Technologies:’, ‘Scenarios:’, ‘Audience:’), we could then add rules like ‘you must tag your content with at least one product or technology’. My fear with any required amount of tags though is that people will just assign something randomly to get past the requirement, especially for a post that isn’t really about any given product. Consider this post, Countdown to Microsoft Ignite. What product is this about? Nothing really, it is about an event, it is useful, we are happy to have it on the site… but not about a product. Is it about any ‘Scenario’? Nope, not really. It does target one or more audiences, but that’s it.

So if we require a product tag, this will either end up tagged with something general like ‘Windows Server’, which corrupts our video catalog, or we add a ‘Product’ of ‘Ignite’, which is corrupting our tags. We could at least make suggestions though… “You haven’t tagged this video with any products or technologies. Are you sure it isn’t about any specific ones?”

Honestly, I feel this is like trying to decide on the best way to catalog your Blu-Ray collection (assuming you still have one); do you go alphabetical, by genre, by year released, or maybe in order of your personal rating? Or do you just toss them in a drawer and when you want to watch something just end up using Netflix anyway?

What about a hierarchy?

The idea of grouping tags into buckets like Product wasn’t a bad one, and I think it would help, but what if we also added the concept of parent-child relationships to our tags? What if Azure Media Indexer was a child of Azure Media Services, which was a child of Azure?

3 level hiearchy of tags, Azure to Azure Media Services to Azure Media Indexer

Perhaps, if we had this relationship, we could use this knowledge to make the tagging job easier. Post is tagged only with Azure Media Indexer? We infer the rest of the tags all the way up the hierarchy. This has benefits when finding content; when you view ‘all posts tagged Azure’, we could return posts that are tagged with Azure Media Indexer too. The other benefit is for the content contributor; they could just tag with the very specific tag and wouldn’t have to remember to add all the others. That has merit, but it wouldn’t have stopped someone from tagging a post with just Azure like they did with the Azure Friday video given as an earlier example.

We could look at the tags used and ‘notice’ (I like putting human style verbs to what will turn into a series of Linq statements in code) that you are using a tag that has children, but that you are using none of the children. “You’ve tagged this video with ‘Azure’, which is a general tag that has 50 more specific child tags such as … do any of these also apply to your video?” Not sure if it would be sufficient to just check for tags that happen to have children, maybe we’d have to flag tags in the system as being very broad.

If we are creating parent-child relationships, we could take this another step and create ‘aliases’ as well, similar to what Stack Overflow has done with their tag synonyms. This would help avoid the continual creation of duplicate tags in our system (such as Microsoft Research, MS Research, MSR, MSResearch), and increases the value of each tag.


I hesitate to even add this, because I don’t know exactly how we go about building it, but I know it will come up. It seems the major issue in our system is getting the contributors to put the right tags on their content, so what if we just did it for them? In a way, we do this already through our search, you can search on Azure Media Indexer and you will find the video with Adarsh and Scott even though it wasn’t tagged right. It had the right terms in the title and description, which is probably the first place we would look to find terms for auto-tagging. Still, tags are useful for establishing a navigation structure in the site in a way that search doesn’t provide, so extracting key terms would be useful. Understanding what terms are important, which ones are ‘Products’ for example, would be a challenge. Seems like the sort of project Microsoft Research could help with, but might be a bit daunting for our tiny 2 to 3-person dev team on Channel 9.


I have none…

Categorizing tags and tag relationships seem like the right way to go, but it will be a lot of work to get this implemented and move all our tags into this model. Until that work finds its way to the top of the priority stack, it is difficult to say if this is the right solution, or even an improvement over our current situation.

Writing Titles for Channel 9 posts

Throughout the rest of the site, on twitter and through Google/Bing … the title of your post is the first and most important description of your content that people will see. With that in mind, here are a few tips to make the most of that space.


You should include enough in your title that it is a clear description of your content, but length is a factor. When we are indexed on Google, the full text is pulled from the <title> element of the page, but only the first x characters are visible in search results. For that reason, including the entire name of your series/show as the first part of the title is really taking away from what is shown to the user on other sites.

Take this video for example

The title of the series is “Windows Azure Pack: Express Installation Walkthrough”, and this is shown on every page where that video is listed (as you can see in the images below).

Series Title in the page title

In the title of the browser window

Series Titles on the post

On the page… in fact, in this case, it is listed 3 times (in addition to being in the title of the tab in the browser)

And in listings elsewhere on the site

Series name shows up in listings

Making the start of the title be the name of the series though, means that the actual specific content of this video is not visible in the title in Bing or Google. The two snippets below are how this entry shows up in these two search engines.

Bing search results


Google Search Results


You can find more info on the limits for visible title length here, but in the end I would focus on putting the right info in the title, and putting the most important parts closer to the front… ideally in the first 50-60 characters.

What to say in the title

I touched on it a bit in the section above, but you should make sure your title accurately describes the content. That’s really the key, but keeping in the mind the length restrictions described above, there are some tricks to make this work well.

First, make sure you know what the point of your video is. Is it to introduce 5 features of ASP.NET 5? Awesome, then you have a lot of room to work with.

5 Things About ASP.NET 5 that will blow your mind!

In the title above, we know it is about ASP.NET 5 … and people love a numbered list of items… and there is nothing wrong with a bit of style and personality to your title (or your content!). If you can be specific, you should be. For example, if it is 5 new features, I’d say features instead of things… but sometimes it isn’t possible.

Include the product or technology you are talking about. Yes, you probably mention it in the body of the post, and in the tags, but the title is extremely important for discovery. Search engines rank it highly and it is also what they show to users, so the decision of whether or not to click is going to depend highly on that title. On that note…

Don’t do ‘clickbait’

You can search around to find out more about this term, but I’ll sum it up: Clickbait headlines are essentially a tease, they give you no real information, but hint at an amazing or interesting story just hiding behind the click. This is not helpful to anyone, even if stats show that they work. Yes, the person clicks, but if the article isn’t content they actually want to read, they just jump away seconds later. You get a page view, but no video view, and you definitely haven’t helped our customers.

For example, you could say this:

“You won’t believe what Microsoft has added to Windows 10…”

Or you could say

“New features in Windows 10”

One implies it is shocking and exciting, the other just tells me what I’ll learn if I click it.

As mentioned earlier, more specifics would be good. Windows 10 is a big product, so maybe  “New features for client applications in Windows 10” or “5 new features in Windows 10 for client applications”. A list of items, “5 hottest cars at the auto show”, is really common in both good and bad headlines online. This style of headline is so common in ‘clickbait’ that sometimes we associate it with that type of misleading content. In reality though, people love numbered lists of things, so as long as you deliver what you promise then I think you should go for it.

Making an app to do audio transcription

Back in the day, I used to write articles for MSDN, and I started up this column called “Coding 4 Fun” (the name lives on today along with some of the same spirit). The premise was that I would write about code I wrote for my own personal use. An app to sync photos from my computer to my mother’s so that she would get updates of the kid’s pictures. A remote control using my Pocket PC (yep, that was a while ago). You get the idea.

That mostly all stopped when I joined the Channel 9 team. I didn’t just stop writing articles, I mostly stopped writing code for any reason other than work. Perhaps that’s a bad thing, but either way … no more writing about code for a few years.

A month or so ago though, I was searching the internet for a solution to a problem for my wife Laura. She writes articles for ParentMap (a local parenting publication here in the Seattle area), and as part of that job she records interviews with people on digital recorder. I noticed that after recording the interview, she would spend a long time transcribing the interview from the sound file into text. You can probably understand how time-consuming that is using something like Windows Media Player. Press play, type for a bit, use your mouse to move the slider back because you missed something, go back to typing, slide it back, pause it, play it… it took ages to turn the audio into text. My first thought was some sort of speech recognition, but:

  • since she would be quoting people it had to be exact, not close and
  • the results I experienced running a few of her recordings through a few apps ranged from hilarious to terrible.

So I downloaded a bunch of apps, free or trial versions, that seemed to be what I wanted … an app for you to control audio playback while you typed up the transcript. Some of the high-priced ones, meant for legal or medical transcription, looked perfect… but Laura is not getting paid doctor or lawyer money to write these articles, so paying those types of prices seemed very wrong. Instead I decided to write something myself. This is the result of maybe 30-45 minutes work, and most of that was spent fiddling with different key bindings for controlling the playback.

The key bindings I settled on were designed to allow you to keep your hands on the keyboard during the entire transcription.

  • Tab pauses the audio, press it again to jump back 5 seconds (configurable) and resume playback
  • \ (which happens to be in a good spot on my keyboard and my wife’s … opposite side but same spot as Tab) just jumps back 5 seconds without pausing.

It works really well for her purposes; well enough that I thought I should post it. So here you go, it is a Visual Studio 2012 package, written in C# and posted here in a zip.

If you just want to run it, I’m not offering free support but you can click on this link and follow the instructions from there:

Channel 9 has a Windows Phone 8 app coming…

And it is available in beta now, please check it out and let me know what you think.


clip_image002 clip_image004 clip_image006 clip_image008

So, here you go, if you have a Windows Phone 8 and have any interest in Channel 9 type content:

Just using the app generates telemetry that is useful, but please reply with any direct feedback you have to me and I’ll consolidate.

If you have a Channel 9 account, you can swipe over to the panel marked “More”, pick Settings and sign into your account. This will let you rate videos and add them to your Queue, but otherwise the app is fully functional without signing in.

This is a beta, and we know some features and stability is still missing, so don’t be surprised if you hit a few issues. And if it crashes, we get that data, so your pain is our gain!


Note that we’ve decided to have the wonderful folks at HiddenPineapple publish this Beta as its own app, which means that when we release a final version (which will be a 1st party app, after the code has been transitioned to my team) you won’t be auto-updated…. sorry about that!



Nokia 920 Wedding Video (Windows Phone 8)

I’m only posting this because my son and I saw this commercial before a movie, loved it, and then for the life of me I couldn’t find it anywhere online. I was (seemingly logically) searching for “Windows Phone Wedding”, sometimes adding Nokia 920… but no luck.


The actual You Tube video is entitled “Switch to the Nokia Lumia 920 Windows Phone” so I would guess my inclusion of “Wedding” was what was throwing off my search results 🙂

Word 2000 VBA (Wrox) Source Code

A few weeks ago someone emailed Channel 9 looking for me, hoping to track down the source code for an old book of mine (Word 2000 VBA). At the time I couldn’t imagine that I’d be able to find it, but I found a set of archive CDs that I’ve made over the years and voila, there it was.

Anyway, in the interest of feeding the internet search engine monster, the source code (in a zip) is attached to this post for anyone who might be looking for it. Download it from here:


Books, Recommendations, and Amazon

I really like to read. I read almost every night and I have since I was a little kid, so I’ve gone through a lot of books. Lately though I’ve noticed a new trend in my reading; I’m reading books that no one recommended to me. Well, no person recommended them to me, I guess it is all Amazon at this point.

Normally, and this goes back to the first time I read the Lord of the Rings, the Belgariad, or the many books of Piers Anthony, someone would mention the book to me and I’d go find it at the book store or the library and I’d be set. For years, this is exactly what would happen, going through Frank Herbert, Anne McCaffrey, Terry Goodkind, etc. but recently I found myself buying two different book series that absolutely no one had ever talked to me about before.

The first was a set of two books (with hopefully more to come!) by Peter V. Brett; The Warded Man (aka The Painted Man, his first novel) and The Desert Spear. Both of these were excellent, the kind of books that I gave up many hours of sleep for because I couldn’t bear to stop reading.  Shortly after, due to Amazon’s decision to show it to me I suppose, I read the debut novel by Patrick Rothfuss; The Name of the Wind and then pre-ordered his follow-up the moment it was available. So, both series are by new authors, completely unknown to me and between the two of them they are some of the best writing I have ever read. I’ve just finished reading The Name of the Wind for the second time as a matter of fact… which is not unusual for me, I don’t want to even try to guess at how many readings of Dune I’ve gone through, but still it says something since it hasn’t been very long since my first reading of Rothfuss’s books that I felt compelled to go through them again.

I think this shift, from all my reading flowing from word-of-mouth suggestions to finding books through the algorithms of Amazon’s recommendation engine, is due to two things. One is that such a system never existed when I was a kid, your friends were the only source of “if you liked that book, then you’d probably like this book” suggestions around. The second, which is a bit sad, is that as far as I can tell almost none of my friends read. We discuss movies, TV shows and video games all the time, and many recommendations are made, but books almost never come up. I won’t go on and on about this, I probably already sound like an old man at this point, no need to take the leap into “the problem with kids these days…”

In a little bit I think I’ll make another post that goes through the various books I was a big fan of over the years. They are, for the most part, all pretty popular books in their genres so it won’t be a great list of new books to run out and get, but so many of them are excellent that they deserve to be discussed.

Responding to feedback on Oxite

Hey folks, many of you are familiar with the commotion that occurred around Oxite’s initial release. For various reasons, Oxite received a lot of attention from developers, bloggers and press… mostly because it is by Microsoft and it contained a lot of buzz-words that people care about (the two biggest being Open Source and CMS… and of course, the aforementioned ‘Microsoft’). This attention was a surprise to us, but it was mostly positive to start with so we were fairly happy. Even all the positive attention was a bit of an issue for us though, as people repeatedly compared Oxite to other products. Most of these products are much larger (SharePoint and WordPress for example) with much larger feature sets, and with years of development and maturity behind them. Overall this was a PR issue for our team, we had to explain to people that we weren’t trying to compete with SharePoint or WordPress (or Umbraco, BlogEngine.NET, Graffiti, etc.) … we were trying to show that it was possible to create standards compliant semantic markup using Microsoft’s web technologies.

A little bit later, various people in the ALT.NET community noticed that this project was using ASP.NET MVC and claimed to be a good example of Test Driven Development. Both of those facts put this firmly in their realm of expertise. We had made mistakes in both areas though, both around architecture and messaging. What happened next was both impressive (great to see such a large reaction from a relatively informal association of people) and completely disheartening (as the development team receiving the feedback) at the same time. The vast majority of the criticism was completely accurate and most of that was presented in a very professional manner. We really appreciated feedback that tried hard to be useful without any overly dramatic statements, but we read all the feedback, regardless of how it was presented.

Processing all that feedback, trying to understand the overall issue and the individual specific items that people were concerned by, took us some time. Luckily we had the help of Rob Conery at that point, to come up with a professionally presented list of specific issues for us to look at. Using his refactoring of Oxite to learn from, combined with lots of great blog posts from Chad Myers, Ayende, Javier Lozano and others, we started work on rebuilding Oxite to reflect the feedback we had been given. We worked over email with Chad, Javier and others as we went, and arranged a couple of code reviews with Phil Haack and Eilon Lipton from the ASP.NET MVC team and with Jonathan Carter and Jason Olson on the DPE side of Microsoft. In short, we did a lot of the things that we should have done the first time around.

Some people believe we should have taken the site and the source down at that point and put it back up when we were ready for this next release, but I decided that we would do our refactoring right there in the public source repository on codeplex. I understand the arguments either way, but decided that doing our work in public would show anyone who was following the project and trying to learn from it that big changes were afoot and that they needed to stay up to date with the source.

So now, here we are again, with an updated release, toned down messaging and without any big PR push behind the project. You can read about Oxite’s architecture here, and you can follow the blogs of Erik, Sampy and Nathan for lots of useful information, and of course you can go to codeplex if you want to check out the code yourself. I hope that you like what you see, that you understand that we are learning many of these concepts as we go, and that we are very open to feedback. We know there are still some issues with the code, Sampy touches on some of them in his post, and for many of those we have a plan for improvement. I also expect that there will be decisions we’ve made that some people will disagree with, but I’m ok with that… let us know what you would have done instead (no, we aren’t asking you to fix it, just let us know or point us at a better example) as we are always learning and considering new options.

Newly updated Oxite release available

bring me the lavender frog!Erik pushed out a new release to Oxite today, the first since January 5th. This release is an important one, because it reflects a great deal of changes made in response to internal and external feedback about our initial release.

From the release notes:

We made many improvements, some based on community feedback, and added new features in this release:

  • New Model, Services and Repositories
  • Dependency Injection (Routes, Controllers, Services, Repositories, etc)
  • ActionFilter Registry
  • Better test coverage
  • New validation class added
  • Improved background services architecture
  • Projects cleaned up and consolidated
  • Views cleaned up
  • No more *.cs or *.cs.designer for views in web project
  • Now works in a sub directory
  • New admin dashboard
  • New and update (from last version) SQL scripts included
  • Many other small features, improvements and bug fixes

A lot of work went into these changes; they were the primary focus of Erik, Sampy and Nathan for the past 4+ weeks. Unity was implemented to provide Dependency Injection, xUnit was used as the test runner to remove a dependency on the higher level SKUs of Visual Studio, and a great deal of work was put into restructuring our data layer to completely abstract the Linq 2 SQL code from our actual objects. Like most software projects, there is always more work that could be done, and we will be making changes and additions as we continue to use this code for our work projects and for our personal sites.

If you are looking for more about Oxite, the discussions, issues and wiki pages on Codeplex are a great source of information and you can always post a comment right here and ask me your question