Gitea: stop search indexer

2 minute read

After ages, I took a trip to the Google Search Console just for fun. Normally, I’m not interested in it because I don’t use Google myself and assumed that the search engines would do their job well out of self-interest.

I was confused when I realized that the majority of my pages didn’t even make it through the indexer.

The problem

How are people supposed to find my website? Even if they knew the correct search terms, they would hardly see anything on Google. After all, the console clearly stated that none of my blog posts appeared in the indexer. On my Gitea instance, on the other hand, several thousand pages were listed. A clear disproportion. And why does Gitea generate such a high volume of files?

Image: Google Search console view for git.schallbert.de. Thousands of pages are crawled that a real user will never be interested in.

Google’s crawler (a program that searches the web for content and makes it indexable for search) apparently finds every commit, no matter how small, in public repositories as well as the files behind them. Automation runs and other metadata are also indexed.

This is of course completely unnecessary and wastes energy in an area that I would much rather have in the presentation of my blog. So a solution is needed.

Possible solutions

A few possibilities immediately occurred to me:

Configure the crawler for each repository so that only the top folder level is searched.
Completely switch off search indexing for git.schallbert.de.
Use Search Console to make corrections until the indexer shows correct assignments on my blog posts.

The most obvious solution for me was to completely switch off the indexer for my Gitea instance. Correction loops only take effect days or weeks later with Google’s crawler and turned out too time-consuming. I deemed a separate robots.txt for each repository too complicated.

My implementation

I read the Gitea documentation. The suggestion of a robots.txt made sense to me immediately, as did its contents:

User-agent: *
Disallow: /

This tells crawlers from any party * that nothing should be indexed from the root directory / onwards.

Unfortunately, I had no idea how Gitea makes this file available in its instance:

To make Gitea serve a custom robots.txt (default: empty 404) for top level installations, create a file with path public/robots.txt in the custom folder or CustomPath.

Gitea’s configuration directory

So I experimented a bit. I created a public folder with the corresponding file, then a custom folder at different levels within the Gitea folder structure. Each time I restarted the container and checked whether the robots file also appeared on the server.

Image: Where I put robots.txt in Gitea's folder tree Nothing.

But then I got rid of the folder names and simply put the file in the Gitea configuration directory - the lowest level at which gitea no longer appears as a folder name. And this was the solution.

Test

After restarting, I was able to successfully display the file in the browser.

Image: Browser successfully loaded my robots.txt for the Gitea instance In the meantime, Google is also starting to remove the unwanted pages from the indexer. My blog articles will hopefully also be available in Google search over the next few weeks.

Gitea: stop search indexer

The problem

Possible solutions

My implementation

Gitea’s configuration directory

Test

Continue reading

Rate limiter with Caddy and fail2ban

Gitea crashes: Too many requests

Creating an own Webshop - Part 2”

Creating an own Webshop - Part 1