Content Taxonomy Musings

My day job is running a bunch of systems as part of Microsoft’s evangelism group and the biggest of those is Channel 9 (https://channel9.msdn.com). Channel 9 is basically a video blogging platform. We have ‘blogs’, which contain posts, written by people. Seems simple enough, but after running for 10 years we have a bit of a problem. Too much content.

For years now, the problem of ‘how should we categorize our content?’ has been floating around as a top issue, but no one ever wants to put it above any of the features they’d like added to the site. It is still an issue though, and something that is at the top of my mind nearly every day. This is not your typical technical blogger post though, where I describe a problem and our brilliant solution. This is truly just my thoughts on the topic, and hopefully it will spur some good discussion. If that discussion helps me get to a solution, that wouldn’t be so bad either!

The situation

Our current categorization system is based on three main attributes that are added to posts:

  • They live in a container of some sort (a blog, a show, an event). A post has to exist in one and only one container
  • They have one or more contributors (the people who made the video, posted the video or appeared in the video… the fact that it could be any of these is a problem for another post)
  • They have zero or more tags

While we could dig into the other two attributes, the key organizing factor in our system is really these tags. Tags are applied by the content creator, and are visible as a means of navigation for the users. Since there is no fixed set, no structure, and no enforced requirements in our tagging, I would classify our tagging system as a folksonomy (more specifically a narrow one, if you want to get into specifics). Essentially, every content creator is responsible for deciding what the right set of tags are for a piece of content. This gives great freedom, but produces terrible inconsistency.

At the moment this post was written, we have:

  • 60 thousand videos up on the site,
  • 10 thousand tags, and
  • 13 thousand videos that have no tags at all

We have over 1,000 new videos being published each month; the problem isn’t getting smaller over time. So what to do?

The first red flag was the fact that 20% of our videos had zero tags, so we now require that you tag the video with something before you can publish it. That helps, but it doesn’t fix the problem. Let’s look at some really good pieces of content on the site and see how they are tagged.

This is a great 10-minute video by Scott Hanselman, covering Azure Media Indexer. This feature of Azure, part of Azure Media Services, will produce captions automatically from your video file. Very cool, everyone who cares about media should watch this.

It’s current tags? Just one: “Azure”. That’s not terrible, it is about Azure after all, but is it sufficient?

The video is about Azure Media Indexer, a Microsoft service, so that would be a good addition. Azure Media Indexer is part of Azure Media Services, so Azure Media Services would be good to add as well. The point of the service is to make captions, which serve two main purposes: Accessibility and Find-ability (not a word, but it seems to fit). So, perhaps Accessibility would be a good addition, and maybe SEO? If you watch the video, Adarsh and Scott talk about how the output of Azure Media Indexer could be fed into a search engine like Azure Search, should we add that? At this point, it starts to be a bit debatable. You could say “yes, they mention Azure Search, so it should be included.” or “No, it isn’t really about it, it is just mentioned in passing… doesn’t count”. It is a judgment call, which means we probably have gone deep enough with “Azure, Azure Media Indexer, Azure Media Services, Accessibility, SEO”. That covers ‘what’ the video is about and even the ‘why’, but there is more going on here from the point of view of categorization. There is the ‘who’ (Scott Hanselman and Adarsh Solanki), the ‘how’ (it is an ‘Interview’), and just to be complete we could even think about the where (it is ‘In Studio’).

If we were posting one video, we could go really detailed with our tagging, but there is a cost to every additional tag we add. The more data we attach to the post, the less meaningful it is, because we are adding attributes that are not the most important aspects of the post. Also, adding these tags takes time, and given the 20% of videos that were posted with zero tags, we can conclude that our contributors do not want to spend a long time categorizing their videos. If we focus in on the topic of the video for now, and ignore the other details (who, how, where), we would be ok with “Azure, Azure Media Indexer, Azure Media Services, Accessibility, SEO”. I could edit this video now and add these tags, but we are trying to fix the system now for the next 60 thousand videos, not just this one, so more thought is required. If it is already possible to add these tags to a video, why wasn’t it already done? It seems like it isn’t a technical issue, but what can we do in our system to get better results?

I have a few ideas, either to make it quicker and easier to tag content, or to provide more structure in our tags to encourage better content attribution. Free flowing thoughts below…

Remember Past Tagging

When someone sat down to post this episode of Azure Friday, it was probably not their first post on Channel 9, and we know it wasn’t the first post in this show. What if we did some simple queries to find the most commonly used tags by the user and/or in the show they are posting to and ‘auto tagged’ the post? I have some fear that we’d suggest the wrong tags (if the last 10 posts were about Azure Websites, we’d suggest that for this video… and it could end up tagged wrong), so it would have to either be just a suggestion (more work for the user, they have to ‘pick’ the tags to include, still easier than a blank input box though) or be really easy to remove the auto-suggested tags.

Categorizing our tags, and then requiring tags in each category

When we dig through our thousands of tags, and when we look at the tagging in other systems in our family (like MVA), we see certain fundamental buckets that tags could be grouped in:

  • Product / Technology (ASP.NET, Azure, Windows, HyperV, etc.)
  • Scenario (Security, Web Development, Deployment, System Management, Testing and so on)
  • Audience (IT Pro, Developer, End User, Student, etc.)

If we moved from a single ‘Tags:’ field in our admin to a set of three (‘Products/Technologies:’, ‘Scenarios:’, ‘Audience:’), we could then add rules like ‘you must tag your content with at least one product or technology’. My fear with any required amount of tags though is that people will just assign something randomly to get past the requirement, especially for a post that isn’t really about any given product. Consider this post, Countdown to Microsoft Ignite. What product is this about? Nothing really, it is about an event, it is useful, we are happy to have it on the site… but not about a product. Is it about any ‘Scenario’? Nope, not really. It does target one or more audiences, but that’s it.

So if we require a product tag, this will either end up tagged with something general like ‘Windows Server’, which corrupts our video catalog, or we add a ‘Product’ of ‘Ignite’, which is corrupting our tags. We could at least make suggestions though… “You haven’t tagged this video with any products or technologies. Are you sure it isn’t about any specific ones?”

Honestly, I feel this is like trying to decide on the best way to catalog your Blu-Ray collection (assuming you still have one); do you go alphabetical, by genre, by year released, or maybe in order of your personal rating? Or do you just toss them in a drawer and when you want to watch something just end up using Netflix anyway?

What about a hierarchy?

The idea of grouping tags into buckets like Product wasn’t a bad one, and I think it would help, but what if we also added the concept of parent-child relationships to our tags? What if Azure Media Indexer was a child of Azure Media Services, which was a child of Azure?

3 level hiearchy of tags, Azure to Azure Media Services to Azure Media Indexer

Perhaps, if we had this relationship, we could use this knowledge to make the tagging job easier. Post is tagged only with Azure Media Indexer? We infer the rest of the tags all the way up the hierarchy. This has benefits when finding content; when you view ‘all posts tagged Azure’, we could return posts that are tagged with Azure Media Indexer too. The other benefit is for the content contributor; they could just tag with the very specific tag and wouldn’t have to remember to add all the others. That has merit, but it wouldn’t have stopped someone from tagging a post with just Azure like they did with the Azure Friday video given as an earlier example.

We could look at the tags used and ‘notice’ (I like putting human style verbs to what will turn into a series of Linq statements in code) that you are using a tag that has children, but that you are using none of the children. “You’ve tagged this video with ‘Azure’, which is a general tag that has 50 more specific child tags such as … do any of these also apply to your video?” Not sure if it would be sufficient to just check for tags that happen to have children, maybe we’d have to flag tags in the system as being very broad.

If we are creating parent-child relationships, we could take this another step and create ‘aliases’ as well, similar to what Stack Overflow has done with their tag synonyms. This would help avoid the continual creation of duplicate tags in our system (such as Microsoft Research, MS Research, MSR, MSResearch), and increases the value of each tag.

Auto-Tagging?

I hesitate to even add this, because I don’t know exactly how we go about building it, but I know it will come up. It seems the major issue in our system is getting the contributors to put the right tags on their content, so what if we just did it for them? In a way, we do this already through our search, you can search on Azure Media Indexer and you will find the video with Adarsh and Scott even though it wasn’t tagged right. It had the right terms in the title and description, which is probably the first place we would look to find terms for auto-tagging. Still, tags are useful for establishing a navigation structure in the site in a way that search doesn’t provide, so extracting key terms would be useful. Understanding what terms are important, which ones are ‘Products’ for example, would be a challenge. Seems like the sort of project Microsoft Research could help with, but might be a bit daunting for our tiny 2 to 3-person dev team on Channel 9.

Conclusion?

I have none…

Categorizing tags and tag relationships seem like the right way to go, but it will be a lot of work to get this implemented and move all our tags into this model. Until that work finds its way to the top of the priority stack, it is difficult to say if this is the right solution, or even an improvement over our current situation.

Writing Titles for Channel 9 posts

Throughout the rest of the site, on twitter and through Google/Bing … the title of your post is the first and most important description of your content that people will see. With that in mind, here are a few tips to make the most of that space.

Length

You should include enough in your title that it is a clear description of your content, but length is a factor. When we are indexed on Google, the full text is pulled from the <title> element of the page, but only the first x characters are visible in search results. For that reason, including the entire name of your series/show as the first part of the title is really taking away from what is shown to the user on other sites.

Take this video for example

http://channel9.msdn.com/Series/Windows-Azure-Pack-Express-Installation-Walkthrough/02

The title of the series is “Windows Azure Pack: Express Installation Walkthrough”, and this is shown on every page where that video is listed (as you can see in the images below).

Series Title in the page title

In the title of the browser window

Series Titles on the post

On the page… in fact, in this case, it is listed 3 times (in addition to being in the title of the tab in the browser)

And in listings elsewhere on the site

Series name shows up in listings

Making the start of the title be the name of the series though, means that the actual specific content of this video is not visible in the title in Bing or Google. The two snippets below are how this entry shows up in these two search engines.

Bing search results

Bing

Google Search Results

Google

You can find more info on the limits for visible title length here, but in the end I would focus on putting the right info in the title, and putting the most important parts closer to the front… ideally in the first 50-60 characters.

What to say in the title

I touched on it a bit in the section above, but you should make sure your title accurately describes the content. That’s really the key, but keeping in the mind the length restrictions described above, there are some tricks to make this work well.

First, make sure you know what the point of your video is. Is it to introduce 5 features of ASP.NET 5? Awesome, then you have a lot of room to work with.

5 Things About ASP.NET 5 that will blow your mind!

In the title above, we know it is about ASP.NET 5 … and people love a numbered list of items… and there is nothing wrong with a bit of style and personality to your title (or your content!). If you can be specific, you should be. For example, if it is 5 new features, I’d say features instead of things… but sometimes it isn’t possible.

Include the product or technology you are talking about. Yes, you probably mention it in the body of the post, and in the tags, but the title is extremely important for discovery. Search engines rank it highly and it is also what they show to users, so the decision of whether or not to click is going to depend highly on that title. On that note…

Don’t do ‘clickbait’

You can search around to find out more about this term, but I’ll sum it up: Clickbait headlines are essentially a tease, they give you no real information, but hint at an amazing or interesting story just hiding behind the click. This is not helpful to anyone, even if stats show that they work. Yes, the person clicks, but if the article isn’t content they actually want to read, they just jump away seconds later. You get a page view, but no video view, and you definitely haven’t helped our customers.

For example, you could say this:

“You won’t believe what Microsoft has added to Windows 10…”

Or you could say

“New features in Windows 10”

One implies it is shocking and exciting, the other just tells me what I’ll learn if I click it.

As mentioned earlier, more specifics would be good. Windows 10 is a big product, so maybe  “New features for client applications in Windows 10” or “5 new features in Windows 10 for client applications”. A list of items, “5 hottest cars at the auto show”, is really common in both good and bad headlines online. This style of headline is so common in ‘clickbait’ that sometimes we associate it with that type of misleading content. In reality though, people love numbered lists of things, so as long as you deliver what you promise then I think you should go for it.

Channel 9 has a Windows Phone 8 app coming…

And it is available in beta now, please check it out and let me know what you think.

 

clip_image002 clip_image004 clip_image006 clip_image008

So, here you go, if you have a Windows Phone 8 and have any interest in Channel 9 type content:

http://www.windowsphone.com/s?appid=9a5115e0-268a-4260-876c-dc0ff2f45c48

Just using the app generates telemetry that is useful, but please reply with any direct feedback you have to me and I’ll consolidate.

If you have a Channel 9 account, you can swipe over to the panel marked “More”, pick Settings and sign into your account. This will let you rate videos and add them to your Queue, but otherwise the app is fully functional without signing in.

This is a beta, and we know some features and stability is still missing, so don’t be surprised if you hit a few issues. And if it crashes, we get that data, so your pain is our gain!

 

Note that we’ve decided to have the wonderful folks at HiddenPineapple publish this Beta as its own app, which means that when we release a final version (which will be a 1st party app, after the code has been transitioned to my team) you won’t be auto-updated…. sorry about that!

 

Thanks!

Responding to feedback on Oxite

Hey folks, many of you are familiar with the commotion that occurred around Oxite’s initial release. For various reasons, Oxite received a lot of attention from developers, bloggers and press… mostly because it is by Microsoft and it contained a lot of buzz-words that people care about (the two biggest being Open Source and CMS… and of course, the aforementioned ‘Microsoft’). This attention was a surprise to us, but it was mostly positive to start with so we were fairly happy. Even all the positive attention was a bit of an issue for us though, as people repeatedly compared Oxite to other products. Most of these products are much larger (SharePoint and WordPress for example) with much larger feature sets, and with years of development and maturity behind them. Overall this was a PR issue for our team, we had to explain to people that we weren’t trying to compete with SharePoint or WordPress (or Umbraco, BlogEngine.NET, Graffiti, etc.) … we were trying to show that it was possible to create standards compliant semantic markup using Microsoft’s web technologies.

A little bit later, various people in the ALT.NET community noticed that this project was using ASP.NET MVC and claimed to be a good example of Test Driven Development. Both of those facts put this firmly in their realm of expertise. We had made mistakes in both areas though, both around architecture and messaging. What happened next was both impressive (great to see such a large reaction from a relatively informal association of people) and completely disheartening (as the development team receiving the feedback) at the same time. The vast majority of the criticism was completely accurate and most of that was presented in a very professional manner. We really appreciated feedback that tried hard to be useful without any overly dramatic statements, but we read all the feedback, regardless of how it was presented.

Processing all that feedback, trying to understand the overall issue and the individual specific items that people were concerned by, took us some time. Luckily we had the help of Rob Conery at that point, to come up with a professionally presented list of specific issues for us to look at. Using his refactoring of Oxite to learn from, combined with lots of great blog posts from Chad Myers, Ayende, Javier Lozano and others, we started work on rebuilding Oxite to reflect the feedback we had been given. We worked over email with Chad, Javier and others as we went, and arranged a couple of code reviews with Phil Haack and Eilon Lipton from the ASP.NET MVC team and with Jonathan Carter and Jason Olson on the DPE side of Microsoft. In short, we did a lot of the things that we should have done the first time around.

Some people believe we should have taken the site and the source down at that point and put it back up when we were ready for this next release, but I decided that we would do our refactoring right there in the public source repository on codeplex. I understand the arguments either way, but decided that doing our work in public would show anyone who was following the project and trying to learn from it that big changes were afoot and that they needed to stay up to date with the source.

So now, here we are again, with an updated release, toned down messaging and without any big PR push behind the project. You can read about Oxite’s architecture here, and you can follow the blogs of Erik, Sampy and Nathan for lots of useful information, and of course you can go to codeplex if you want to check out the code yourself. I hope that you like what you see, that you understand that we are learning many of these concepts as we go, and that we are very open to feedback. We know there are still some issues with the code, Sampy touches on some of them in his post, and for many of those we have a plan for improvement. I also expect that there will be decisions we’ve made that some people will disagree with, but I’m ok with that… let us know what you would have done instead (no, we aren’t asking you to fix it, just let us know or point us at a better example) as we are always learning and considering new options.

Newly updated Oxite release available

bring me the lavender frog!Erik pushed out a new release to Oxite today, the first since January 5th. This release is an important one, because it reflects a great deal of changes made in response to internal and external feedback about our initial release.

From the release notes:

We made many improvements, some based on community feedback, and added new features in this release:

  • New Model, Services and Repositories
  • Dependency Injection (Routes, Controllers, Services, Repositories, etc)
  • ActionFilter Registry
  • Better test coverage
  • New validation class added
  • Improved background services architecture
  • Projects cleaned up and consolidated
  • Views cleaned up
  • No more *.cs or *.cs.designer for views in web project
  • Now works in a sub directory
  • New admin dashboard
  • New and update (from last version) SQL scripts included
  • Many other small features, improvements and bug fixes

A lot of work went into these changes; they were the primary focus of Erik, Sampy and Nathan for the past 4+ weeks. Unity was implemented to provide Dependency Injection, xUnit was used as the test runner to remove a dependency on the higher level SKUs of Visual Studio, and a great deal of work was put into restructuring our data layer to completely abstract the Linq 2 SQL code from our actual objects. Like most software projects, there is always more work that could be done, and we will be making changes and additions as we continue to use this code for our work projects and for our personal sites.

If you are looking for more about Oxite, the discussions, issues and wiki pages on Codeplex are a great source of information and you can always post a comment right here and ask me your question

Chatting today amongst the EvNet team

Aug 29

10:20 AM

Duncan M.

source code formatting checked in

Duncan M.

http://localhost/posts/Sampy/PAX-Day-3-In-…

Aug 29

10:25 AM

Duncan M.

sourcecodeFF

Duncan M.

sourcecodeIE

Duncan M.

I started out with overflow-x:auto … which would add a scroll bar (at least in FF3), but then I went with white-space:pre-wrap; … but IE doesn’t like that 🙂

Aug 29

10:55 AM

Duncan M.

Nathan, can you send me the link to those pre-wrap alternates?

Aug 29

11:00 AM

nathan h.

http://users.tkk.fi/~tkarvine/pre-wrap-css…

Aug 29

11:05 AM

Duncan M.

thanks

Aug 29

2:35 PM

Duncan M.

I wonder if we should consider using this site to create/embed polls? http://www.polldaddy.com

Aug 29

2:45 PM

Erik P.

I’ve noticed a lot of people starting to use js includes and other stuff for polls, comments, ratings, etc.

Duncan M.

yep

Erik P.

For us, I think it’s a matter of integration. Is it something we care to tightly integrate into our system to do custom queries and views things like that or is it something we just want to throw in? For polls, not sure which way is better. What do you think?

Duncan M.

I’m interested in the http://disqus.com/ comment system as well … but not for our core sites

Duncan M.

For polls, I don’t think we’d want to do anything with the data that is user specific

Aug 29

2:50 PM

Duncan M.

I think they really just want to put it up on the site, gets lots of interaction (including non registered users) and then discuss the results

Aug 29

2:50 PM

Erik P.

Then I think those external things is a good idea. 🙂

Erik P.

is = are

Duncan M.

the ideal type of intergration I could picture with something like polldaddy.com would be to associate a discussion with it somehow (like making it an entry) so that we could show the poll on the home page (sidebar?) and then have a ‘click to discuss’ option

Duncan M.

this could be manual even, just create a forum thread about the poll, embed the poll in that thread *and* on the home page, and then put a link below the poll on the home page to the thread

The new and improved Channel 9 has shipped!

When I joined my current team, it was called the Channel 9 dev team, because Channel 9 was the big site that they had built and was the center of all of their efforts. You certainly wouldn’t have known that from how we spent the last two years though 🙂

We built a whole new code base for a video blog site and launched a new site (Channel 10) on that code, bringing some of the video style of Channel 9 to a new audience. We often discussed, as we shipped out revision after revision of the Channel 10 home page, that our next goal would be to ship out a new version of Channel 9… all moved onto that new code. Things didn’t go according to plan though, and we shipped out VisitMix.com, Channel 8, and TechNet Edge (for IT Pros) before we were really given more than a moment’s peace to start planning out the work to deploy Channel 9 on a new code base.

When it did become time to plan the channel 9 deployment, it turned out that, while the new code base was a major improvement in many ways, the feature gap between what was already available on 9 and what the new code could offer was substantial. Now at least a year and a half had gone by since I joined the Channel 9 dev team, and we could see a long road ahead of us to get the next version of Channel 9 shipped; all the while, new features, bug fixes and UI changes for the other channels kept coming up and mixed priorities from above meant that we still had to devote most of our time to those properties. Finally, at the start of this year (2008), I was asked to start laying out a firm plan that would get Channel 9 v4 (as we took to calling it) shipped. As part of that, we gained a new focus on channel 9 and were able to finally prioritize it above some of the day to day needs of our other properties. We weren’t completely focused on the task, but we finally had the ability to say no to most things that would pull us away from C9… progress began to be made. At the start of May of this year we shipped out a beta version of Channel 9, and by the end of the month (June 2nd to be precise) we did the final switch over, made the DNS change that pointed channel9.msdn.com at a new set of web servers, and v4 was officially launched. Over the next week or so, we were bogged down dealing with a wide variety of bugs that didn’t turn up during our own testing or in the beta, but the site stabilized and is now running fairly well. We still have bugs, but the old site had bugs that had sat for months, so I think we are in pretty good shape. You can see a picture gallery of the various Channel 9 home pages that were envisioned or deployed throughout the past few years, including the new one… and a similar gallery representing the many home pages of Channel 10 that we shipped over the course of just a few months.

 

version 3, C9 when I joined the team (shipped August 2005)

 

new, v4 (June 2, 2008)

 

all in all, I’m very happy with the site we shipped last week and very impressed with the team behind it all… it has been a busy couple of years, but it is nice to have tangible (if you can consider the web tangible) results that you can point at when you are thinking about what you’ve accomplished.

Sidebar Gadgets for Channel 9, Channel 8 and more

We recently had Donovan West (LiveGadgets.net) build us a set of sidebar gadgets for Windows Vista. These gadgets use the RSS feeds from each site and let you see all of our new content as it gets posted, then (using Silverlight) you can even play our videos right there on your desktop.

The Gadget in its open state

 

You can check out the gadgets (for Channel 9, Mix, C8, Edge, and on10.net) by clicking on the appropriate image below:

Channel 9 Gadget Channel 10 Gadget for on10.net TechNet Edge Gadget Mix Online (visitmix.com) gadget Channel 8 (channel8.msdn.com) gadget

Using my Xbox Live data service?

If you’ve written an app, private or public, using my data feed of Xbox live info I’d really appreciate it if you’d let me know. This isn’t a required ‘sign-up’, but I want to start to keep track so that I could possibly create a page listing all the sites using it, and it may also be useful to be able to contact folks if I need to make a change or take the service down for an hour or so. Just comment on this post, or drop me a line at duncanma@microsoft.com.