Academic Website V: Telling Machines Who You Are

Hendrik Erz

Abstract: In today's article, I want to look closer at the SEO-aspect of a website, Search Engine Optimization. Here I describe how you can make yourself and your publications machine-readable, and enrich your website with metadata so that search engines and other applications can make more use of your website.

Published on Friday, December 22nd, 2023 by Hendrik | 19 min reading time

In the previous four articles I have guided you through initial considerations before, and actually setting up your personal website. At this point you hopefully have a personal website, have chosen a cool theme and added the minimum amount of information to it. Or you first want to read the entire series to make sure you have gotten everything before, in which case you probably don’t yet.

Today, I will be sharing how you can tell machines who you are. Because all you’ve done by now is telling people who you are: Your website is visible for humans, and it contains what you want to tell them. But you also need to think about people who don’t yet know you, and those will more often than not find you via some search engine.

It turns out that what will really make your website shine is SEO, or Search Engine Optimization. In this article, I will fundamentally only tell you how to make your website “machine-readable”, so to speak, and I will introduce many systems of doing so, some which all do pretty much the same, but still are all somehow required (relevant XKCD): Basic HTML metadata, JSON-LD, OpenGraph, Dublin Core, and Highwire Press tags. I’ll also run you through a few other SEO strategies, like using robots.txt and a sitemap and using rel="nofollow" on links properly.

SEO contains a lot of arcane knowledge that is really simple to use, but hard to find online. Again, there are entire book series on how to properly do SEO, but in this article I want to focus on a few fundamentals that you should get up and running as soon as possible.

So let us begin.

The first start: `robots.txt` and `sitemap.xml`

The most basic part of a website is a robots.txt file. It’s a simple text file that lives directly at the root of your website and can tell crawlers which pages to index, and which ones not. Google has a dedicated guide on how to properly write a robots.txt file. You don’t actually need this if you don’t upload something to your website that the world shouldn’t see, because the default behavior of many crawlers is to just visit your website and crawl whatever they find.

You can use this, for example, to disallow crawling of images or PDF files. My own robots.txt is relatively empty, because I don’t upload anything I don’t want to become public knowledge to this website. I have other places where I can put stuff that Google should not index. But depending on your use cases, this might be a good place to start. Some websites may even come with a robots.txt already filled in for you. This is, for example, the case for WordPress.com pages. You should be able to find the file by typing https://your-domain.tld/robots.txt.

Here’s my robots.txt for reference.

User-agent: *
Allow: /
Host: https://hendrik-erz.de

User-agent: GPTBot
Disallow: /

Sitemap: https://www.hendrik-erz.de/sitemap.xml

As you can see, there are a few properties you can set; the most important of which are “User-agent”, and “Allow” or “Disallow”. A User-agent identifies a crawler; for example, there is the GoogleBot that indexes your page for search, and in my case you can see GPTBot. I found this user-agent while researching the article, and what I basically do is prevent OpenAI from crawling my blog. Finding my articles? Absolutely. Using them without payment for commercial LLMs? Nope. There are many different crawlers out there, and depending on your preferences, you may ask them not to crawl your website; among them are the FacebookBot or BingBot. You can find what to put into the User-agent properties by googling around.

These files work like so: First, you define a User-agent. The asterisk (*) simply means “all” (but some crawlers will ignore that). Then, you define a series of “Allow”- and “Disallow”-URLs. In my case, I simply allow the crawling of all URLs, because there’s nothing private on my website (as mentioned, I have other places for that). There is also the Host-property that allows you to specify your domain.

Finally, you will see the property “Sitemap”. This is the next piece of information I want to tell you about.

If you open your website and look around, you will see that your website has a set of different pages: A home page, a CV page, and if you want to write blog posts, a blog page. Now, for a human it is pretty easy to see which pages are where, and how important they are. There are pages on your website that are very important and will be updated often, but there are also those that are pretty unimportant. But again, a machine is dumb, so you’ll have to help it. A sitemap does exactly that: It is a (machine-readable) map of your site. Specifically, it will include all links that you want to have indexed with an assigned priority (how important is this page in your opinion?) and a change frequency (how often does one expect the page to change?). If you add a sitemap, you can simply leave out URLs you don’t want crawled. Google even recommends you provide a sitemap instead of meticulously allowing/disallowing various URLs.

Just like your robots.txt, your sitemap.xml lives at the root of your website. Most websites will generate one (semi-)automatically for you, or have some plugin that does this. So creating a sitemap.xml should be a piece of cake – refer to the documentation of whichever backend you use to know how to make it generate one for you.

One final piece of advice, though: Both robots.txt and sitemap.xml should mostly be considered suggestions. There are no legal rules forcing crawlers to respect the robots.txt. Similarly, crawlers can recognize the sitemap.xml, or fully ignore it and go sniffing around. So if you have things that you don’t want them to see, simply don’t put them online. Don’t rely on those files to actually constrain any automated system that wants to index your website. I do suggest adding those files, specifically for those bots that do read them in and follow them; there’s just no guarantee. Google for example will recognize them, but apparently chooses to ignore the change frequency and priority of the sitemap.

Making Yourself Machine Readable with Structured Data

In the following sections, I will introduce a few common systems to make your website machine-readable. But before doing so, it makes sense to talk a bit about the conceptual side of things. Specifically, the first question we need to tackle is: what can you actually make machine-readable?

Your website as it is will already be machine-readable. The HTML code, the CSS, and the JavaScript that make up your website are designed to be consumed by a machine – more specifically, your browser. Your browser will read in the HTML code, apply some CSS styling, and make the website dynamic with JavaScript. But this is not what you want to make machine-readable.

What you want to make machine-readable is what is not obvious from simply looking at the HTML code of your website. HTML can represent anything you want, but a machine will have difficulty deciding whether what it is looking at is a blog post, a CV, or a personal website like yours. From a machine standpoint, all websites look the same, because they all use HTML code. So to help a machine decide what it is it’s looking at, you’ll want to add a bit of metadata to each of your website’s sub-pages to make that clear.

The systems I am going to introduce enable you to do just that. Be it JSON-LD, Dublin Core, or OpenGraph: These systems are used to describe things in a structured manner and put them in a relationship with each other. There are systems that are used by search engines, by social media, or reference managers in order to understand what it is they are looking at.

Let us very briefly talk about one system that is closest to academia: citations and publications. If you use Zotero, you will probably have used the Browser Connector to save some journal article down to your collection. What the connector does is almost magically extract the correct metadata for those journal articles – authors, title, abstract, publication, date, and so on. But how does it do that? It uses two systems of describing publications in a machine-readable way: Dublin Core and Highwire Press. Those are specific pieces of information that it will look for in every website. This is precisely how Zotero knows whether you currently have a journal article open, or the New York Times: because all of these websites make use of these metadata tags to describe the content on their website.

While you will probably not host an entire journal on your website, there is at least one, maybe two different things that you will need on your website. The first one is you yourself. A machine will not know that it is a website that describes a single individual, so you need to tell it. Similarly, if you decide to start a blog, you will have another type of thing to describe: blog posts. So when a machine opens your website, it should only see a structured description of yourself, but as soon as it opens a blog post, it should additionally see data to describe that blog post.

This is what these various systems allow you to do: describe what things there are present on a website, how the information should be interpreted. You absolutely do not have to make use of everything I describe in this article, but you should at least use the metadata to describe your person.

With that out of the way, let’s dive into the depths of Dublin Core, Highwire Press, JSON-LD, and OpenGraph.

Making Yourself Machine Readable with JSON-LD

JSON-LD is an acronym for “JavaScript Object Notation – Linked Data” and it is a machine-readable structure that links data about you in JSON format (read more on JSON here). This is a data structure that will be respected by many web crawlers and can help them link various social media profiles of you back to your website.

Remember that the internet is a gigantic graph, and all that matters are links between various parts of the internet. I already alluded to that in a previous article when I said that machines are too dumb and humans too lazy to copy your Twitter handle and paste it in the app. When you actually link your Twitter account, both machines and humans will be more likely to actually follow that link.

But machines still need a bit more care. The thing is, you can link several different Twitter accounts on your website, and while us humans can easily see which one is your personal one, a machine cannot. JSON-LD really makes it obvious to web crawlers which links between different parts of the internet exist. With JSON-LD you can show which social media profiles are yours, which organization you are a part of, and which side projects are part of you.

You will need to put some time into JSON-LD, but the basic steps are simple once you have your own object:

Create JSON-LD for you
Insert that into your website’s HTML head

I will walk you through step one here; step two depends heavily on the backend that you’ve chosen. For a static site generator, this often involves creating a template file and linking that at the appropriate place. Since this is something similar to actual JavaScript, you can follow the same steps as adding additional JavaScript to your website, so follow the instructions for that. You can have a look at the source code for my website to see where in the page’s HTML it should appear.

But now let’s have a look at my own JSON-LD. Below I have copied the entire thing and commented it to explain what each property does. (Note that JSON does not support comments, so remember to not use them in your actual JSON-LD!)

<script type="application/ld+json">
{
// The @context is required
"@context": "http://www.schema.org",
// I am a person; but you can also create JSON-LD for a company, or a CollegeOrUniversity (see below)
"@type": "Person",
// This should be your website; use the full address
"@id": "https://www.hendrik-erz.de/",
// Your name(s)
"name": "Hendrik Erz",
// Interestingly, alternateName seems to be required, so you can just duplicate your name here.
"alternateName": "Hendrik Erz",
// I hope that one is obvious.
"nationality": "German",
// Unfortunately, I don't have any awards yet, but you can add yours here
"award": [],
// Affiliation is exactly that: A list of institutions you are affiliated with. Remember to update that as soon as you switch universities!
"affiliation": [
	{
		"@type": "CollegeOrUniversity",
		"name": "Institute for Analytical Sociology, Linköping University, Sweden",
        // The sameAs property always contains absolute URLs to various pages that are the same thing. Here I have added our Twitter Account as well.
		"sameAs": [
			"https://liu.se/en/organisation/liu/iei/ias",
			"https://twitter.com/IAS_LiU"
		]
  	}
],
// I hope this is self-explanatory -- it works just like affiliation.
"alumniOf": [
	{
	 "@type": "CollegeOrUniversity",
	 "name": "University of Bonn",
	 "sameAs": "https://en.wikipedia.org/wiki/University_of_Bonn"
	}
],
// Leave that out if you prefer not to say
"gender": "Male",
// The next two should describe your main role in a very broad and specific sense.
"Description": "Researcher",
"disambiguatingDescription": "PhD-Candidate in Analytical Sociology/Computational Social Science",
"jobTitle": "PhD-Student",
// For us PhD students, this should be the same as affiliation, but you can add any side hustle here
"worksFor": [
	{
		"@type": "CollegeOrUniversity",
		"name": "Institute for Analytical Sociology, Linköping University, Sweden",
		"sameAs": [
			"https://liu.se/en/organisation/liu/iei/ias",
			"https://twitter.com/IAS_LiU"
		]
  	}
],
"url": "https://www.hendrik-erz.de/",
// I just saw that I should probably provide an image of me here (again, this should be an absolute URL)
"image": "",
// This is where the actual spice lies: Here you can link all different URLs to various social media to indicate that these are in fact your own things.
"sameAs": [
	"https://twitter.com/sahiralsaid", // My Twitter
	"https://instagram.com/nathan_lesage", // My Instagram
	"https://www.linkedin.com/in/hendrik-erz/", // My LinkedIn
	"https://github.com/nathanlesage", // My GitHub
	"https://liu.se/en/employee/hener68", // My Staff page
	"https://scholar.google.com/citations?user=L8y-sWQAAAAJ", // Google Scholar
	"https://bsky.app/profile/hendrikerz.bsky.social" // Bluesky
	]
}
</script>

Of course, there are many more properties that you can use to describe yourself; these are described here. Use what you want, and leave out what doesn’t apply to you. Also, if you specifically want your website to stick out a bit from Google search results, here are some examples for how to do so.

There are also some generators out there that will generate parts of or the entire spaghetti code automatically, for example this one.

Showing Off on Social Media With OpenGraph

Now that you’ve told search engines what they can index and what not, and told Google a bit more about you, it is time to do the same for social media.

While JSON-LD is great for generally telling machines how you relate around the web, did you know that you can also help social media make your website look better? When you share a link on Twitter, you will see that this expands to what is called a “card” in Twitter lingo. This usually consists of a title, a short synopsis of the link, and a preview image. You can actually control what these are for your own website!

When you don’t add the corresponding tags to your website, your website will look a bit pale because social media platforms will use some basic defaults. But with a few simple HTML tags, you can make your website look much better – regardless of who links your website. Just like with JSON-LD, I’m simply going to show what my website contains and explain what the various things do. Note the various {{ something }}-tags in there: These are template tags that my CMS uses. For static website generators, they will look similar, for other CMS they may look different.

<!-- This is a title specific for twitter. Can be the same or different than the <title> tag. -->
<meta name="twitter:title" content="{{ this.page.title }}">

<!-- Here I check whether there is a dedicated synopsis for the given page (e.g., the abstract for a blog post) ... -->
{% if this.page.excerpt %}
<meta name="twitter:description" content="{{ this.page.excerpt }}">
{% else %}
<!-- ... or not. If not, then I will use the generic meta description for Twitter. -->
<meta name="twitter:description" content="{{ this.page.meta_description }}">
{% endif %}

<!-- These two tags are very similar to the JSON-LD data I described above: They contain my Twitter handle. -->
<meta name="twitter:site" content="@sahiralsahid">
<meta name="twitter:creator" content="@sahiralsaid">

<!-- This tells Twitter how to display this website if it is linked on Twitter. For my website always summary_large_image, because I like how it looks. -->
<meta name="twitter:card" content="summary_large_image">

<!-- This is the large preview image that you can provide so that your links have a nice image attached to them. -->
<!-- I don't use specific images for every blog post, so for my website, this is always the same, no matter which article one links. -->
<!-- It should be 1600x900 pixels. -->
<meta name="twitter:image" content="https://www.hendrik-erz.de/storage/app/media/hendrikerzde_socialmedia.png">

<!-- The following are Open Graph directives. This is almost the same as the Twitter-properties, but for Facebook. -->
<!-- If you want your links to always look good, make sure to include both. Many other websites such as Mastodon -->
<!-- or Bluesky will use one of these as well to preview your content. -->
<meta property="og:locale" content="en-US">
<meta property="og:type" content="article">
<meta property="og:title" content="{{ this.page.title }}">
<!-- Same description if/else as above -->
{% if this.page.excerpt %}
<meta property="og:description" content="{{ this.page.excerpt }}">
{% else %}
<meta property="og:description" content="{{ this.page.meta_description }}">
{% endif %}
<meta property="og:url" content="{{ ''|page }}">
<meta property="og:site_name" content="Hendrik Erz">
<meta property="og:image" content="https://www.hendrik-erz.de/storage/app/media/hendrikerzde_socialmedia.png">

You can find all the various pieces of metadata that social media platforms will look for here (Twitter) and here (OpenGraph). As you can see, while all of your pages should include that information, most of it will stay the same for most of your subpages. For a really static personal website, you can probably write some template file for that specific piece of information, and call it a day. Only if you have a blog as I do should you add a bit more spice to it.

Which leads me to the next section:

Describing Blog Posts and Other Citeable Things With Dublin Core and Highwire Press

When I included the OpenGraph and Twitter metadata above, I left out the second half of the entire file that contains the metadata directives for my blog articles. While you could in principle use JSON-LD or OpenGraph to describe those as well, if you want to make your blog citeable by, e.g., Zotero, you will want to use yet a different system. If you are frustrated of why there are so many different systems that effectively do the same, let me again link the relevant XKCD.

Again, here’s the code with explanatory comments:

<!-- All of the following is only generated by my website if the page actually contains a blog post, which in my case I check with "if post" -->
{% if post %}
  <!-- The following are Dublin Core Specifications -->
  <meta name="DC.Title" content="{{ post.title }}">
  <meta name="DC.Creator" content="Hendrik Erz">
  <!-- If you make use of keywords/tags in your blog posts, you can put them here instead of something hard coded. -->
  <meta name="DC.Subject" content="Text Analysis; Machine Learning; Sociology; Programming; Web Development; Computational Social Science">
  <meta name="DC.Description" content="{{ post.excerpt }}">
  <meta name="DC.Date" content="{{ post.published_at | date('Y-m-d') }}">
  <meta name="DC.Type" content="Text">
  <meta name="DC.Format" content="text/html">
  <meta name="DC.Identifier" content="{{ post.url }}">
  <meta name="DC.Source" content="Hendrik Erz, Personal Website">
  <meta name="DC.Language" content="en-US">
  <!-- This one here really makes it hard not to think of JSON-LD -->
  <meta name="DC.Relation" content="https://www.hendrik-erz.de/" scheme="IsPartOf">

  <!-- While Zotero supports many DC properties, I have found that Zotero seems to be smarter when it comes -->
  <!-- to a different type of properties called "Highwire Press". -->
  <!-- By including these properties, the Browser Connector will change its icon to something that conveys -->
  <!-- "Hey, this is citeable!" and ensure that Zotero will save the correct metadata. -->
  <meta name="citation_title" content="{{ post.title }}">
  <meta name="citation_author" content="Hendrik Erz">
  <meta name="citation_publication_date" content="{{ post.published_at | date('Y-m-d') }}">
  <meta name="citation_public_url" content="{{ post.url }}" />
{% endif %}

I think this is one of the coolest things that you are able to do with metadata properties on your website: Make Zotero and other citation engines properly pick up your blogposts as citeable publications. The combination of Dublin Core specifications and Highwire Press tags will even be picked up by Google that will then offer your blog posts as actual publications in Google Scholar! Note, however, that while Dublin Core is a properly described system for adding publication metadata to your website, most systems curiously enough will often only pick up Highwire Press tags. Interestingly enough, there is no definite resource for which tags exist, as this StackOverflow thread reports (but it contains a few dozen tags that are apparently valid).

You can read more about the various ways of describing digital resources on this website that I just found.

Further Considerations

Besides these five fundamental ways of making your website machine-readable that I personally think have the biggest impact on your reach and visibility, there are millions of other considerations. A few pieces of knowledge I found personally helpful are the following.

When linking to companies or websites that you don’t owe anything, always include the attribute rel="nofollow" to ensure that search engines don’t follow those links. This can improve your page rank. Conversely, be generous in linking to colleagues or websites that are niche but very cool: Every link both fosters the search engines’ association of you with that link and helps the linked pages. Of course, in those cases, don’t include rel="nofollow".

Then, especially when you modify the HTML of your website, ensure to use semantic HTML. This will help search engines, but it will also make your website accessible to disabled people, because screen readers can distinguish a navigation bar in a <nav>-tag from the actual article, <article>. This also ensures that people can read your website in the dedicated reading mode that Firefox and Safari have: The reading mode in these browsers will hide everything outside the <article>-tag.

Besides these, you should absolutely ensure that your website is accessible (a11y for short) and “mobile ready”. I haven’t explicitly covered these in this section because (a) this would’ve made the entire thing even longer, and (b) most pre-defined themes already account for that. So in an ideal world, simply by using a pre-existing theme, you will already have an accessible and mobile ready website.

Conclusion

There are many more things to consider if you want to really go nuts with SEO, but I believe that you don’t have to implement every new trend by some marketing agency. I personally think that those SEO elements that I presented in this article are sufficient to already make your website appear and function professionally.

Now we are nearing the end of this article series, but there are two additional things I want to pass on to you before I let you explore the world of web design: A few notes on keeping your website safe and up to date, and what to consider if you want to move around with your website. Finally, I want to close with a few tips & tricks that I have found really cool over the past 20 years of making websites.

So, until then, see you next time!

Academic Website V: Telling Machines Who You Are

The first start: `robots.txt` and `sitemap.xml`

Making Yourself Machine Readable with Structured Data

Making Yourself Machine Readable with JSON-LD

Showing Off on Social Media With OpenGraph

Describing Blog Posts and Other Citeable Things With Dublin Core and Highwire Press

Further Considerations

Conclusion

Suggested Citation

Send a Tip on Ko-Fi

The first start: robots.txt and sitemap.xml

Making Yourself Machine Readable with Structured Data

Making Yourself Machine Readable with JSON-LD

Showing Off on Social Media With OpenGraph

Describing Blog Posts and Other Citeable Things With Dublin Core and Highwire Press

Further Considerations

Conclusion

Suggested Citation

Send a Tip on Ko-Fi

The first start: `robots.txt` and `sitemap.xml`