On Use of the Lang Attribute

HTML5 Logo with character for Chinese number 5.

Way back in October I noticed this WHATWG HTML bug (26942) where someone asked why do these examples of <html> lack the lang attribute? I thought the answer from Hixie was a bit dismissive and not based on any data or real-world benefits of use, particularly in the context of screen readers:

Why not? Realistically, few people include it. It just means the language is unknown.

At the time, I could not get the latest archive to download from WebDevData.org (though that has changed, see below), so I fell back to asking for help on why the lang attribute is valuable.

How the lang Attribute on <html> Is Used

I got lots of good bits of feedback, which I collected into a Storify (crappy PDF version since the new Storify domain owners may block WayBack). I’ve distilled all that great information to these key points:

In the absence of setting a lang attribute on the <html> element, screen readers will fall back to the user’s default system setting (barring any custom overrides) when speaking content.

How Many Pages Use lang

On January 8, WebDevData.org (from a W3C Community Group) posted its latest archive (which did not error on download, woo!). It consists of the HTML from 87,000 web pages.

I pulled down the 780MB file and re-taught myself the skills necessary to parse the files. For those who are regular expression geniuses, you are welcome to suggest an alternate approach, but I used the following pattern to return all the <html> elements: <html([^>]+)>. It fails for any <html> with no attributes at all, but for what I am doing that’s ok.

Of the 84,054 pages I parsed (I excluded XML, ISO files, and so on), I found that 39,433 use the lang attribute on the <html> element. That’s just about 47% (46.914% if I understand significant digits correctly).

What that tells me is that instead of the case being that few people include it, nearly half the web includes it.

There are 12,672 instances of xml:lang, though at a quick scan they appear alongside lang. If anyone with better regex skills would like to help me further parse, please let me know.

Why You Should Use the lang Attribute on the <html> Element

Hyphens

By using lang, you get the benefits of hyphen support in your (modern) browser that you otherwise would not get (assuming you use hyphens: auto in your CSS).

Accessibility

At the very least, lang is a benefit for screen reader users, particularly when your users don’t have the same primary language as your site. It allows proper pronunciation and inflection when the page is spoken.

WCAG Compliance

Including the lang is a Level A requirement of the Web Content Accessibility Guidelines 2.0 (specifically item 3.1.1 Language of Page). Technique H57 identifies the lang attribute specifically.

Internationalization

The W3C Internationalization (I18n) Activity has a great Q&A on why you should use lang, which was updated less than two months ago. I’ll reprint the start of the answer, but there is far more detail and I strongly recommend you go read it.

Identifying the language of your content allows you to automatically do a number of things, from changing the look and behavior of a page, to extracting information, to changing the way that an application works. Some of language applications work at the level of the document as a whole, some work on appropriately labeled document fragments.

We list here a few of the ways that language information is useful at the moment, however, as specifications and browsers evolve in the future there could be numerous additional applications for language information.

Interesting Aside

If you go to the WHATWG HTML5 specification today and view the page source, you’ll see the following language declaration in the code:

<html class=split data-revision="$Revision: 8877 $" lang=en-GB-x-hixie>

Not to be outdone, the W3C HTML5 spec has the same language declaration.

If anybody has the en-GB-x-hixie phonologic dictionary in his or her screen reader, I’d love to hear it.

While technically allowed (the -x puts it in the private use sub-tag category), it’s bad form:

Private-use subtags do not appear in the subtag registry, and are chosen and maintained by private agreement amongst parties.

Because these subtags are only meaningful within private agreements and cannot be used interoperably across the Web, they should be used with great care, and avoided whenever possible.

Update: January 1, 2015

For what it’s worth, I’ve filed bugs against the W3C HTML5 spec and the WHATWG HTML5 spec.

Update: February 25, 2015

Another case where a lang attribute is important, though in this case on a specific element, is outlined in the piece HTML5 number inputs – Comma and period as decimal marks:

<input type=”number”> will open a numeric software keyboard on modern mobile operating systems. Not every user can input decimal numbers into this convenient field without proper localization.

[…]

Half the world uses a comma and the other half uses a period as their decimal mark. (In Latin scripts.) Does your web application take that into consideration? Do the browsers?

Update: April 18, 2016

The WHATWG spec documents have finally been updated (hopefully all of them) to include the lang attribute:

If you look at the changes on GitHub, you’ll see that the lang attribute has been added to code examples as well.

Update: April 19, 2016

The W3C spec is also being updated with notes on how and why to use lang:

If you have more you’d like to add to that note, you can do so over on GitHub.

Update: May 20, 2016

Steve Faulkner made a related demo.

The same English language text “What will you do when the label comes off, And the plastic’s all melted, And the chrome is too soft?” is marked up in paragraphs with es/fr/de lang attribute values. These each effect the way the text in pronounced.

test file used: http://s.codepen.io/stevef/debug/MyapoQ

Update: July 14, 2016

The W3C HTML5 validator will now try to detect the language of the page you are validating and compare it to the lang attribute as well as the dir attribute. If you these do not appear to match, then you will get a warning.

Update: April 16, 2018

Much of this post and a lot of new material made it into my London Web Standards talk Mind Your lang.

Update: December 5, 2021

I should have been more explicit in here that you should avoid region sub-tags (en versus en-us) unless users have a genuine need for a specific dialect.

From a best practice perspective (emphasis theirs):

You always start by choosing a primary language subtag, and often this is all you’ll need for your language tag.

Always bear in mind that the golden rule is to keep your language tag as short as possible. Only add further subtags to your language tag if they are needed to distinguish the language from something else in the context where your content is used.

As for screen reader support, region sub-tags are generally ignored (emphasis theirs):

If a screen reader supports the regional differences—such as having language voices installed for both Peninsular and Mexican Spanish—then the screen reader may switch to the appropriate dialect.

However, region subtags are typically ignored, especially if the screen reader’s default language matches the primary language specified. This is because it is presumed the user will prefer and better understand their default dialect over a different dialect of that same language. The numerous screen reader users in Great Britain, for example, will typically hear Great Britain English on U.S. web sites, even if the page has lang="en-US" specified and the user has the US English language voice installed.

Only use region subtags when it is necessary to differentiate content in different dialects that may not be mutually intelligible. A web site that provides content in both Mandarin and Cantonese (one of which may not be understood by speakers of the other) would typically differentiate them using lang="zh-CN" and lang="zh-HK" respectively. […]

Script and and extended language sub-tags have poor support in screen readers.

9 Comments

Reply

Better regex to select HTML tags

In response to Jake Wilson. Reply

Jake, was there a regex example in your comment? I am not seeing one.

Reply

I had a screenreader user on complaining that my website made his screenreader read out oddly: turned out the en-gb tag was switching his TTS. Confused him utterly! Most users won't have multiple TTS variants, I suspect, but might be best to stick with "en" rather than "en-gb" and "en-us" so as not to surprise users?

In response to alasdairking. Reply

Was the content in British English? For example, did the content include spellings like "colour" instead of "color?" Punctuation inside or outside of quotes?

I have no idea if those would cause a difference in experience, but that's certainly where I'd start looking. I'll also ping some folks and see if others are familiar with this.

Reply

Better <html> regex: <htmlb.*?>

b is a break
.* is anything. ? afterwards makes it "not greedy"

In response to Stephen Kamenar. Reply

Thanks, I will give that a shot and see what I get.

Reply

Document language is typically easy to set and maintain, but doing so simply by HTTP header can be more practical. Changes in language are more costly to set and maintain, and detection of such changes may be best delegated to software (I don’t think this is covered here, however). Our whole way of working with language may not be ideal [1].

[1] http://meiert.com/en/blog/20140825/html-and-language/

Reply

I was able to scare up some additional reasons to use lang:

  • Default font selection for CJK languages (very politically important, there’s still hurt feelings over how Unicode handled them)
  • The quotation marks around <q> change with the language
  • CSS’s ::first-letter pseudo-element can behave badly in non-English languages without lang. Other upcoming CSS typographical niceties like hanging punctuation are also likely to require it.
  • The spellcheck attribute needs it to function properly, especially for multilingual users, as browser heuristics aren’t perfect
  • Input types for dates and times are also need it, because around the world there is absolutely no de facto standard for writing them up. This huge variance is partly why browsers have been so slow to implement these types.
In response to Taylor Hunt. Reply

Taylor, your comment came to mind when I read this post and I figured it might be useful for those who don’t know the CJK font issue: Localization Gotchas for Asian Languages (CJK) (I did localization in that part of the world over a decade ago, and I wasn’t clear on it).

Leave a Comment or Response

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>