2009/08/30

Automatic generation of META tags for ASP.NET

Some of the well known tags commonly used in SEO are the three following meta tags:  meta title tag, meta keywords tag and meta description tag:

<meta name="title" content="title goes here" /> 
<meta name="keywords" content="keywords, for, the, page, go, here"/> 
<meta name="description" content="Here you will find a textual description of the page" />

A lot has been written about the benefits of using them, and almost the same amount telling that they are not considered anymore by search engines. Anyway, no matter if they are used or not on the calculation of SERP (Search Engine Results Page), nobody discusses the benefits of having them correctly set on all your pages. At least meta description tags are somehow considered by Google, since Google Webmaster Tools warns you about pages with duplicate meta description:

Differentiate the descriptions for different pages. Using identical or similar descriptions on every page of a site isn't very helpful when individual pages appear in the web results. In these cases we're less likely to display the boilerplate text. Wherever possible, create descriptions that accurately describe the specific page. [...]

Download the VB project code

The question is not “should I use meta tags in my pages?”, the real question (and here comes the problem) is “how can I manage to create individual meta descriptions for all my pages?” and “how can I automate the process of creating meta keywords?”. That would be too much work (or too much technical work) for you (or your users, if they create content on their own).

For instance, consider a CMS (Content Management System) in which users are prompted for some fields to create a new entry. In the simplest form, the CMS can ask the user to enter title and content, for the new entry. In advanced-user mode, the CMS could also ask the user to suggest some keywords, but the user will probably enter just two, three or four words (if any). The CMS needs a way to automatically guess and suggest a default set of meta keywords based on the content before definitely saving the new entry. Those could be checked, and eventually completed by the user, and then accepted. Meta title and meta descriptions are much easier, but will be covered also in our code.

In our sample VB project we will not suggest keywords for the user to confirm, we will just calculate them on the fly and we will set them without user intervention. We will use a dummy VirtualPathProvider that will override the GetFile function in order to retrieve the virtualPath file from the real file system, so it is not a real VirtualPathProvider in the whole sense, just a wrapper to take control of the files being served to ASP.NET before they are actually compiled. A VirtualPathProvider is commonly used to seamless integrate path URLs with databases or any other source of data rather than the file system itself. Our custom class inheriting from VirtualPathProvider will be called FileWrapperPathProvider. In our case it will not use the full potential of VirtualPathProviders, since we will only retrieve the data from the file system, do minor changes to the source code on the fly and return them in order to be compiled. This will introduce a bit of overload and some extra CPU cycles before the compilation of the pages, but that will only happen once, until the file needs to be compiled again (because the underlying file has changed, for instance).

Our FileWrapperPathProvider.GetFile function will return a FileWrapperVirtualFile whenever the virtualPath requested falls under the conditions of IsPathVirtual function: the file extension is .aspx or .aspx.vb and the path of the requested URL follows the scheme of ~/xx/, that is to say, under a folder of two characters (for the language, ~/en/, ~/de/, ~/fr/, ~/es/, …). In other case, it will return a VirtualFile handled by the previously registered VirtualPathProvider; ie. none, or the filesystem itself without any change.

We have chosen to use a VirtualPathProvider wrapper around the real file system just to show what kind of things can be done with that class. If your data is on a database instead of static files, you will probably be using your own VirtualPathProvider, and in that case it will work by virtualizing the path being requested and retrieving the file contents from the database instead of the filesystem. Whichever the case, you can adapt it to your scenario in order to make use of the idea that we will illustrate in this post.

The idea is somewhat twisted or cumbersome:

  1. Parse the code behind file for the page being requested (.aspx.vb file) and, using regular expressions (regex), replace the base class so that the page no longer inherits from System.Web.UI.Page and inherits from System_Web_UI_ProxyPage instead(a custom class of our own). This proxy page class declares public MetaTitle, MetaDescription and MetaKeywords properties and link them to the underlying meta title, meta description and meta keywords declared inside the head tag in the masterpage. When a page inherits from our System_Web_UI_ProxyPage, it will expose those 3 properties that can be easily set. See System_Web_UI_ProxyPage.OnLoad in our sample project.
  2. Read and parse the .aspx file linked to the former .aspx.vb file (the same without the .vb) and make a call to JAGBarcelo.MetasGen.GuessMetasFromString method which makes the main job with the file contents. See FileWrapperVirtualFile.Open function in the sample project.
  3. Besides of changing the base class to that of our own, we add some lines to create (or extend) Page_Init method on that .aspx.vb file. In those few lines of code that are added on the fly we set the three properties exposed by System_Web_UI_ProxyPage class and that we have just calculated.
  4. Return the Stream as output of the VirtualFile.Open function with the modified contents so that it can be compiled by ASP.NET engine, based on the underlying real file, using the formerly calculated meta title, meta keywords and meta description. Note that this is done in memory, the actual filesystem is not written at any time. The real files are read (.aspx.vb and .asxp), parsed, and the virtual contents created on the fly and given to ASP.NET. You need too be really careful since you can run into compile-time errors in places that will be hard to understand, since the filesystem version of the files are the base contents, but not the actual contents being compiled.

The way we calculate the metas in JAGBarcelo.MetasGen.GuessMetasFromString is:

  1. Select the proper set of noise-words set depending on the language of the text.
  2. Look for the content inside the whole text. It must be inside ContentPlaceholders (we will suppose you will be using masterpages), and we will look for a particular ContentPlaceHolder that contains the main body/contents of the page. Change the LookForThisContentPlaceHolder const inside MetasGen.vb file in order to customise it for your own masterpage ContentPlaceHolder's names.
  3. Calculate the meta title as the text within the first <h1> tags right after the searched ContentPlaceHolder.
  4. Iterate through the rest of the content, counting word occurrences and two-word phrases occurrences, discarding noise words for the given language.
  5. Calculate the keywords, creating a string that will be filled with the most frequent single-word occurrences (up to 190 characters), and two-word most frequent occurrences (up to 250 characters in total).
  6. Calculate the description, concatenating content previously parsed, to create a string between of 200 and 394 characters. Those two figures are not randomly chosen, Google Webmaster Tools warns you when any of your pages has meta descriptions shorter than 200 or longer than 394 characters (based on my experience).
  7. Return the calculated title, keywords and description in the proper ByRef parameters.

A good thing about this approach, using a VirtualFile is that you can apply it to your already existing website easily. No matter how many pages your site has, hundreds, thousands,... this code adds meta title, meta keywords and meta descriptions to all your pages automatically, transparently, without user intervention, very little modifications (if any) to your already existing pages and it scales well.

Counting word occurrences.

We iterate through the words within the text under consideration (ContentPlaceHolder) and store their occurrences into a HashTable (ht1 for single-words and ht2 for two-words). All words are considered in their lowercase variant. The word must have more than two characters to taken into account and must not start with a number. If it passes the former fast test, it is checked against a noise-word list. If it is not a noise word, it is checked against the existing values in the proper HashTable and included (ht1.Add(word, 1)), or its value incremented (ht1(word) = ht1(word) + 1) if it was already there.

Regarding the noise words, we first considered some word frequency lists available out there, but then we thought about using verb conjugations as well. So we first created MostCommonWordsEN, an array based on simple frequency lists, and then we created also MostCommonVerbsEN based on another frequency list which considered only verbs. At the end we created MostCommonConjugatedVerbsEN, where we stored all the conjugations of the former most common English verbs. When checking a word against one of these word strings we only use MostCommonWordsXX and MostCommonConjugatedVerbsXX (where XX is one of EN, ES, FR, DE, IT). Yes, we did the same for other languages like Spanish, French, German and Italian, whose conjugations are much more complex than -ed, -ing and -s terminations. For automatic generation of all possible conjugations for the given verbs (in their infinitive form) we used http://www.verbix.com/

Calculating meta title.

It will be the text surrounding the first <h1> and </h1> heading tags right after the main ContentPlaceHolder of the page.

Calculating meta description.

Most of the time a description about what a whole text is about (or at least it should) is within the first paragraphs of it. Based on that supposition, we try to parse and concatenate text within paragraphs (<p></p> tags) after the first <h1> tag. Based on our experience, when the meta description tag is longer than 394 characters, Google Webmaster Tools complain about it being too long. Taking that point in mind, we try to concatenate html-cleaned text from the first paragraphs of the text to create the meta description tag, ensuring it is not longer than 394 characters. Once we know the way our meta descriptions are automatically created, all we need to do is create our pages starting with an <h1> header tag followed by one or more paragraphs (<p></p>) that will be the source for creating the meta description for the page. This will be suitable for most scenarios. In other cases, you should modify the way you create your pages or update the code to match your needs.

Calculating meta keywords.

Provided those noise word lists for a given language, calculating the keywords (single word) and key phrases (two words) occurrences within the text was something straightforward. We just iterate through the text, check against noise words, and add a new keyword or increment the frequency if the given keyword is already on the HashTable. At the end of the iteration, we sort the HashTables by descending frequency (using a custom class inheriting from System.Collections.IComparer). The final keywords list is a combination of the most frequent single keywords (ht1) up to 190 characters, and the most common two-word key phrases (ht2), until completing a maximum of 250 characters. All of them will be comma separated values in lowercase.

Summary.

Having meta tags correctly set is a must, however it is difficult to set them manually on every page sometimes, furthermore not forgetting all possible keyword combinations. Too much frequently only a few words are added, and this is when automatic keyword handling can help. If you consider this might be your case, please, download our sample VB project and give it a try (and a few debug traces too). I will be waiting for your comments.

Links.

Internet Information Services IIS optimization

2009/08/21

Fixing “Padding is invalid and cannot be removed” when requesting WebResource.axd

If you are using ASP.NET in your website and have a look at your Application EventLog you will probably see warning entries like this:

CryptographicException: Padding is invalid and cannot be removed.

Event Type: Warning
Event Source: ASP.NET 2.0.50727.0
Event Category: Web Event
Event ID: 1309
Date:  21/08/2009
Time:  13:08:48
User:  N/A
Equipo: WEBSERVER
Description:
  Event code: 3005
  Event message: An unhandled exception has occurred.
  Event time: 21/08/2009 13:08:48
  Event time (UTC): 21/08/2009 11:08:48
  Event ID: 1cc59501bae34562a1e486c16f2e799f
  Event sequence: 11912
  Event occurrence: 1
  Event detail code: 0
  Application information:
    Application domain: /LM/W3SVC/1/ROOT-1-128952696565995867
    Trust level: Full
    Application Virtual Path: /
    Application Path: C:\Inetpub\webs\www.test-domain.com\
   Machine name: WEBSERVER
  Process information:
    Process ID: 3920
    Process name: w3wp.exe
    Account name: TEST-DOMAIN\IWAM_WEBSERVER
  Exception information:
    Exception type: CryptographicException
    Exception message: Padding is invalid and cannot be removed.
  Request information:
    Request URL: http://www.test-domain.com/WebResource.axd?d=pFeBotgPWN6u7M4UfAnWTw2&t=633687432177195930
    Request path: /WebResource.axd
    User host address: 127.0.0.1
    User:
     Is authenticated: False
    Authentication Type:
     Thread account name: TEST-DOMAIN\IWAM_WEBSERVER
  Thread information:
    Thread ID: 12
    Thread account name: TEST-DOMAIN\IWAM_WEBSERVER
    Is impersonating: False
    Stack trace:
       at System.Security.Cryptography.RijndaelManagedTransform.DecryptData(Byte[] inputBuffer, Int32 inputOffset, Int32 inputCount, Byte[]& outputBuffer, Int32 outputOffset, PaddingMode paddingMode, Boolean fLast)
       at System.Security.Cryptography.RijndaelManagedTransform.TransformFinalBlock(Byte[] inputBuffer, Int32 inputOffset, Int32 inputCount)
       at System.Security.Cryptography.CryptoStream.FlushFinalBlock()
       at System.Web.Configuration.MachineKeySection.EncryptOrDecryptData(Boolean fEncrypt, Byte[] buf, Byte[] modifier, Int32 start, Int32 length, IVType ivType, Boolean useValidationSymAlgo)
       at System.Web.UI.Page.DecryptStringWithIV(String s, IVType ivType)
       at System.Web.Handlers.AssemblyResourceLoader.System.Web.IHttpHandler.ProcessRequest(HttpContext context)
       at System.Web.HttpApplication.CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
       at System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously)
Custom event details:
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

Depending on how busy is your web server you can see them appear from time to time or up to every few minutes, thus filling your EventLog and being from a light annoyance up to a real problem (depending on how hypochondriac you are).

In fact, they are just warnings that can be ignored on most of the cases, but they can be a real problem when they bury other events and the forest do not let you see the trees. If there are many of them and you want to get rid of them (or most of them at least), keep on reading.

You might check your IIS Log by the times when the warnings appear and (if you also log user-agent) you will probably see that most of the time the URL is NOT requested by a real user, but a spider engine doing its crawl (googlebot, msnbot, yahoo, tahoma, or any other). You can double check doing a reverse dns check for the offending IP address doing a ping –a aaa.bbb.ccc.ddd and you will also see the IP resolves to something like *.googlebot.com, *.search.msn.com, *.crawl.yahoo.net or *.ask.com. This should give you a hint on what to do…

WebResource.axd is just an httpHandler that wraps several resources within the same DLL. It is in charge of returning from little .gif files for serving the arrows of the ASP:Menu control, to .js files governing the behavior of the menu itself. Even though your website do not use ASP:Menu control, you probably will be using WebResource.axd for javascript dealing the post back of your form or any other thing.

Why does this exception happen?

If you see in detail the parameters following the WebResource.axd request you will notice two of them. The first one d refers to a particular resource embedded in the httpHandler DLL. It is a fixed value as long as the source DLL is not updated or recompiled. The second t parameter is a timestamp parameter that changes whenever the web application (AppPool) is recompiled (a changed/updated DLL, an update to web.config, and so) and depends on the machineKey of the web site. If web.config does not explicitly declare a fixed machineKey, the t parameter will change from time to time (restarts, job recycles, etc).

In fact these CryptographicException warnings are well known in web farms configurations. In that case, all the servers belonging to the same farm must have the same machineKey because if a served page (.aspx container page) by a particular server of the farm includes a value of t parameter and the subsequent request for that URL resource is handled by other server of the farm, the exception would arise and the user could not download the resource. And, in this case we would be talking about real browsers with real users behind them, not spider engines.

Furthermore, if you have implemented a conditional GET in your webserver, this exception is more likely to happen, since a user can come back to your site, do a request for a page that has not changed, being returned a 304 Not Modified, and still request the resources included in that page, that might be invalid due to the change of t.

The solution: two steps.

As you can imagine, the first thing that you can do is setting a fixed machineKey in your web.config file. Even though you are not running a cluster, nor a web farm, it will help you to minimize the occurrences of the warning Padding is invalid and cannot be removed.

For this you can use a machineKey generator or generate your own if you know how to do it (random chars will not work).

<system.web>
  <machineKey
    validationKey='A06BDCF2F6CF.A.VERY.LONG.44F13E76184945A7C477601'
    decryptionKey='99079B21C2F3644.A.BIT.SHORTER.BB81C7E9D58378'
    validation='SHA1'/>
</system.web>

The second (and easier) step to follow is to prevent WebResource.axd URLs from being requested as much as possible. In particular by search engines crawlers or bots, since those resources should not be indexed nor cached in any way by them. Those URLs are not real content to be indexed. If you only add the following lines to your robots.txt you will see how the frequency of CryptographicException is reduced drastically. If you also change the machineKey to a static value, you will get rid of them almost completely.

User-agent: *
Disallow: /WebResource.axd

As I said, you will get rid of this warning almost completely. There might be search engines not following your robots.txt policies, users visiting you from a Google cached page version, etc. so you cannot get rid of this warning messages for good, but yet enough for not being a problem anymore.

Summary.

Summing up, this event appears when there is a big time difference (lap time) between the page that contains the resource and the resource itself being requested. During that lapse, the application pool might have been recycled, recompiled, the server restarted, etc, thus changing the value of t and thus, rendering the older t value useless (the cryptographic checks fail).

Links.

Internet Information Services IIS optimization

Keywords.

WebResource.axd, CryptographicException, padding, invalid, removed, machineKey, exception, warning, IIS