Twtxt UserAgent log parser for common web server access logs to identify who/what is following your feed(s)
You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
James Mills cc9266e4a5
Add LICENSE
6 months ago
.gitignore Hard fork 6 months ago
LICENSE Add LICENSE 6 months ago
Makefile Hard fork 6 months ago
README.md Hard fork 6 months ago
go.mod Hard fork 6 months ago
go.sum Hard fork 6 months ago
main.go Hard fork 6 months ago
useragent.go Hard fork 6 months ago
useragent_test.go Hard fork 6 months ago

README.md

useragent

useragent is a Twtxt User-Agent HTTP request header analyzer, which helps discovering new people to follow or check whether certain people are able to receive mentions or not.

It reads an Nginx access log file with from stdin and generates a simple statistic on stdout. I use the following Nginx config for my twtxt.txt file:

log_format twtxt '$time_iso8601 "$request" $status "$http_user_agent"';
server {
    …
    location = /twtxt.txt {
        access_log /somewhere/twtxt.log twtxt;
    }
    …
}

So the access log file looks like:

2022-01-12T22:39:42+01:00 "GET /twtxt.txt HTTP/1.1" 304 "twtxt/1.2.3 (+https://example.com/txtxt.txt; @somebody)"
2022-01-12T22:40:07+01:00 "GET /twtxt.txt HTTP/1.1" 304 "jenny/latest (+https://example.org/peter.txt; @peter)"

Although, useragent should work fine with any access log format where the User-Agent is logged in double quotes at the very end of a line. The webserver also must somehow escape or encode potential double quotes inside of the header value (Nginx hex-encodes doubles quotes to \x22).

Supported User-Agent Formats

  • Official single user client format, e.g. twtxt/1.2.3 (+https://example.com/twtxt.txt; @somebody), both semicolon and comma are recognized separators
  • Extended multi user client format, e.g. twtxt/0.1.0@abcdefg (~https://example.com/whoFollows?token=randomtoken123; contact=https://example.com/support)
  • Old twtd multi user client format with 2-5 followers, e.g. twtxt/0.1.0@69ac73b (Pod: example.com Followers: hugo kate Support: https://example.com/support)
  • Old twtd multi user client format with 6 or more followers, e.g. twtxt/0.1.0@37fd365 (Pod: example.com Followers: eugen hugo kate lieschen richard and 3 more... https://example.com/whoFollows?uri=https://example.org/twtxt.txt&nick=steffi&token=OzcdPbe6Z Support: https://example.com/support)

In case of both old twtd formats, all followers directly found in the User-Agent header are extracted and their twtxt.txt URLs constructed from the support URL or Who Follows Resource URL.

The Who Follows Resources are not queried by default. The latest URLs (assuming newest log records are always appended to the access log) of each encountered hostname are printed. Operators can either manually query their followers or use the -r flag to resolve them automatically.

Example Usage

$ useragent < /somewhere/twtxt.log
Twtxt UAs: 16841  Non-Twtxt UAs: 1709
  343 @kate → http://example.com/user/kate/twtxt.txt
 4309 @eugen → http://example.com/user/eugen/twtxt.txt
   34 @lieschen → http://example.com/user/lieschen/twtxt.txt
   34 @hugo → http://example.com/user/hugo/twtxt.txt
 9902 @richard → http://example.com/user/richard/twtxt.txt
  983 @peter → https://example.org/peter.txt
32900 @somebody → https://example.com/twtxt.txt
   34 http://example.com/whoFollows?followers=8&token=gLvOWbFYT

Analyzing Unknown User-Agents

The -u or --show-unknown flag prints all the unknown User-Agents in raw form with their occurrences to stdout, too. This way the program can be further tuned and new twtxt formats built into it.

The -c or --classify-unknown flag classifies and prints all the unkown User-Agents with their occurrences to stdout. It uses a very simple builtin mapping table for Twtxt clients, bots, libraries and browsers. Please note: this mapping is far from complete and even may not produce correct classifications. This flag implies -u/--show-unknown.

The -g or --group-unknown flag groups and prints all the classified unknown User-Agents with their occurrences to stdout. This is similar to -c/--classify-unknown, except that the output is even more condensed and only the groups of the classified agents are shown. Groups are Twtxt client, Bots, Libraries, Browsers, - (if empty) and (???) (for unmapped agents). This flag implies both -uc or --show-unknown --classify-unknown.

Resolving and Fetching Who Follow Resources

The -r or --resolve-who-follows flag attempts to resolve nicks and feed URLs from all the encountered Who Follow Resources. Instead of the Who Follow Resource URLs, the nicks and feed URLs will be printed together with the number of hits.

In order to change the default timeout of 1s when resolving the Who Follows Resources, -t or --timeout can be used to raise or lower it. This flag accepts everything, that Go's time.ParseDuration(…) function recognizes:

A duration string is a possibly signed sequence of decimal numbers, each with optional fraction and a unit suffix, such as 300ms, -1.5h or 2h45m. Valid time units are ns, us (or µs), ms, s, m, h.

-- https://pkg.go.dev/time#ParseDuration

Please note, that yarnd's multi user User-Agents contain a token, that will be invalidated, once used. So successive calls to the same Who Follows Resource will fail until a new feed fetch with a new token gets logged. useragent will neither keep track of used URLs nor cache the resolved nicks and feed URLs for future reuse.

Supported Who Follow Resource Formats

  • Of course the official Who Follow Resource JSON format is supported, e.g. {"somebody": "https://example.com/user/somebody/twtxt.txt"}
  • Yarnd replied with a buggy JSON array of objects format for quite some time, e.g. [{"Nick": "somebody", "URL": "https://example.com/user/somebody/twtxt.txt", "LastFetchedAt": "2022-01-02T23:04:22.208290302Z"}]
  • Yarnd even sent empty URL but filled URI fields, e.g. [{"URI": "https://example.com/user/somebody/twtxt.txt", "URL": "", "Nick": "somebody", "LastSeenAt": "2022-01-03T08:54:36.638147948Z"}]

Licnese

useragent is licensed under the terms of the AGPLv3 and the code was originally hard-forked and borrowed (with permission) from @lyse@lyse.isobeef.org's tt Twtxt client.