In order for a website to be searched and found by internet users, the page must be listed in search engines, indexes, or directories. Both search engines (such as Google) and indexes (such as the Open Directory Project, Yahoo, etc...) are automated. Robots and spiders scan the internet looking for new and modified webpages, harvest links and other material and pass on looking for further content. The internet is a marvelously interwoven web. Websites and their component pages are listed in the search engines and indexes based upon keywords, titles, and other content found within the headers. Most search engines and indexes only scan the header content. Body content is ignored. Thus, properly formatted headers are essential. On this page, we will show you how to develop your pages to ensure accurate placement in the major search engines and indexes.
As you have already learned elsewhere, HTML is structured in two sections: the header and the body. The header is used to transmit global variables, meta-data, and other essential information for the web browser. Content in the header is not visible to the user. The content of the body is generally visible to the user. It is the header content that concerns us here. For more about the body content, see the HTML Code page.
Consider a sample webpage. The headers of the docu.htm page are as follows:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <title>Documents of Ecumenical Interest</title> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> <meta name="description" content="Documents of Ecumenical Interest" /> <meta name="keywords" content="documents, ecumenical, ecumenism, interchurch, inter-church, interfaith, inter-faith, inter-religious, inter-denominational, church, Christian, unity, denomination, denominational" /> <link rev="parent" href="http://www.ecumenism.net/" /> <link rel="stylesheet" type="text/css" href="txt/pce.css" title="css" /> </head> <body> ...
The header is technically only the content between the <head> ... </head> tags. Note, however, that there are two lines prior to the first <head> tag. <!DOCTYPE ... > is a tag used to tell the browser what kind of document it is. In this case, it tells the browser that it is Hypertext Markup Language (HTML) 4.01 with English content. This tag is sometimes generated by your HTML editor. If not, don't worry. The page will display accurately without it. If you want you can copy this tag. It is unlikely that you will be using a more recent version of HTML.
The <title> tags enclose a descriptive title of the page. Please ensure that every page has a different title. Do not attempt to publish pages without titles. Titles are displayed in the top bar of the browser window.
A trick:
it is possible to change the title using a JavaScript function after the page is loaded. Our website carries both English and French titles. Based on the user's language preferences, the French title may be substituted for the English. To do this place the following after the <body> tag:<script>function changeTitle() {document.title="New title";}</script>in the text, place the code:
<a href="javascript:changeTitle();">Change the title</a>You can try it yourself: Change the title of this page. You can play with this in a variety of ways.
The <meta http-equiv="Content-Type" ... > tag is used to tell the browser which character set to use. This is used to allow websites to use non-Latin character sets such as Greek, Cyrillic, Hebrew, Chinese, etc... English character sets are generally either ISO-8859-1 or Windows-1252. Although either will allow use of all normal English characters and punctuation, the Windows character set is rendered differently on Macintosh and other non-Windows operating systems. For a standard look and feel, specify the ISO character set. You will not have to make any adjustments to your content, merely change this tag.
The <meta name="description" ... > tag is used to describe the page in more detail than is possible in a document title. Our abbreviated description should be expanded upon.
The <meta name="keywords" ... > tag is one of the most important tags. With this tag you can greatly expand the searchability of your page. These keywords will be indexed and searchable in any search engine that scans our site. Keywords should be entered in priority sequence from most relevant to least, should be separated by commas, and should never be duplicated. Be generous in the number of keywords entered, but never enter inaccurate terms.
Some keywords are normally associated with pornographic sites. Use of these terms greatly increases the traffic to the site, but will result in the entire domain being blocked by "content advisors" such as Net Nanny. Please do not do this.
The <link rev="parent" ... > tag specifies the parent file of this page. Other files in the /docu directory specify docu.htm as their parent. This assists search engines in identifying the structure of a website. It allows automated systems to facilitate navigation through your pages. You should point all pages to your index.htm (home page) or to another of your pages. DO NOT specify our pages as your parent if you want to be listed independently in Google.
The <link rel="stylesheet" ... > tag is used to specify the cascading style sheet (CSS) used by the page. CSS allows for a standardised use of fonts and other layout specifications. It also permits easy changes to the entire website. Our website uses this stylesheet, but we ask that our IPs develop their own stylesheets. Please try to develop your own "look".
Okay, that is all you need for the basic headers of a webpage. There are numerous additional tags that you could use. Many tags are specific to certain search engines, and will be ignored by all others. If you develop a site that you want to submit to the search engines, check the other websites in your category to see how they developed their headers. It will give you an idea of what succeeds.
Have you ever used a library catalogue? Notice how every item has a separate record with multiple fields for author, title, subject, publisher, etc... That is called "metadata." It is data about data. A book is itself merely a collection of data. The metadata is used by the library computers to assist your search process. Properly catalogued books are easy to find, improper or inaccurately catalogued books may never be found again.
"Although the concept of metadata predates the Internet and the Web, worldwide interest in metadata standards and practices has exploded with the increase in electronic publishing and digital libraries, and the concomitant "information overload" resulting from vast quantities of undifferentiated digital data available online. Anyone who has attempted to find information online using one of today's popular Web search services has likely experienced the frustration of retrieving hundreds, if not thousands, of "hits" with limited ability to refine or make a more precise search. The wide scale adoption of descriptive standards and practices for electronic resources will improve retrieval of relevant resources in any venue where information retrieval is critical." (Dublin Core Metadata Initiative)
Well, now it is possible to specify metadata about your webpages. But, wait a minute, isn't that what we did with the headers? Right, now you're catching on! The headers are an HTML protocol for metadata. However, what if you want your site to be properly catalogued in a library computer? It happens, but only with Dublin Core Metadata. Using Dublin Core in your pages can allow specialised systems to harvest the data and provide access to users in much more accurate search engines.
At this point in time, our website has very little Dublin Core embedded in the pages. However, as we expand the permanent archive, we intend to add this metadata to all pages. IPs should consider placing Dublin Core in any documents submitted to the archive.
As a preliminary schema, all pages should use the following extra headers:
<link rel="schema.DC" href="http://purl.org/DC/elements/1.0/" /> <meta name="DC.Creator" content="Jesson, Nicholas" <meta name="DC.Title" content="Orthodox contributions to ecumenical ecclesiology" /> <meta name="DC.Subject" content="Orthodox Eastern churches; Christian unity; Theology" /> <meta name="DC.Description" content="A study of the Eastern Orthodox churches participation in the ecumenical movement, with particular focus upon their theological reflections on the nature of the church." /> <meta name="DC.Type" content="essay" /> <meta name="DC.Date" content="October 2001" /> <meta name="DC.Format" content="text/pdf" /> <meta name="DC.Language" content="en" /> <meta name="DC.Identifier" content="http://www.ecumenism.net/archive/jesson_orthodox.pdf" /> <meta name="DC.Rights" content="Copyright Nicholas Jesson 2001" />
Further specifications can be found at http://www.ietf.org/rfc/rfc2731.txt. Two elements that we would like to see used wherever possible allow encoding of ISBN and Library of Congress Catalog Numbers. These will obviously only apply to a small number of items.
<meta name="DC.Identifier" scheme="ISBN" content="1-56592-149-6" /> <meta name="DC.Identifier" scheme="LCCN" content="67-26020" />
Note that each element name has a prefix of DC. Additional schemas are envisioned that may use alternative prefixes.
Dublin Core can be added to HTML files at any time. However, it is difficult to add DC to Adobe PDF files after creation. It is recommended that you specify the DC in your wordprocessor and then convert the file using Adobe's Distiller.
Sometimes we create a webpage that is not intended to be searchable. For example, a comment form or search box. As a consideration for others searching the web, we can instruct the search engines to ignore certain pages that are only used for local housekeeping. The easiest way to exclude a page from the robots and spiders is to enter a <meta> tag in the following format:
<meta name="robots" content="noindex, nofollow" />
Alternative specifications are all, index, and follow. Perhaps you want your page to be excluded from the search engines, but you have included links to other pages that you want the robots to consider. Use:
noindex, follow
Be aware that some search engines will ignore these instructions. Email harvesting software used by Spammers regularly ignores these tags. To hide email addresses from harvesters, try our nospam JavaScript.