|
6 months ago | |
---|---|---|
.gitignore | 6 months ago | |
LICENSE | 6 months ago | |
Makefile | 6 months ago | |
README.md | 6 months ago | |
go.mod | 6 months ago | |
go.sum | 6 months ago | |
main.go | 6 months ago | |
useragent.go | 6 months ago | |
useragent_test.go | 6 months ago |
README.md
useragent
useragent
is a Twtxt User-Agent
HTTP request header analyzer, which
helps discovering new people to follow or check whether certain people
are able to receive mentions or not.
It reads an Nginx access log file with from stdin and generates a simple statistic on stdout. I use the following Nginx config for my twtxt.txt file:
log_format twtxt '$time_iso8601 "$request" $status "$http_user_agent"';
server {
…
location = /twtxt.txt {
access_log /somewhere/twtxt.log twtxt;
}
…
}
So the access log file looks like:
2022-01-12T22:39:42+01:00 "GET /twtxt.txt HTTP/1.1" 304 "twtxt/1.2.3 (+https://example.com/txtxt.txt; @somebody)"
2022-01-12T22:40:07+01:00 "GET /twtxt.txt HTTP/1.1" 304 "jenny/latest (+https://example.org/peter.txt; @peter)"
Although, useragent
should work fine with any access log format where
the User-Agent
is logged in double quotes at the very end of a line.
The webserver also must somehow escape or encode potential double quotes
inside of the header value (Nginx hex-encodes doubles quotes to \x22
).
Supported User-Agent
Formats
- Official single user client
format,
e.g.
twtxt/1.2.3 (+https://example.com/twtxt.txt; @somebody)
, both semicolon and comma are recognized separators - Extended multi user client
format, e.g.
twtxt/0.1.0@abcdefg (~https://example.com/whoFollows?token=randomtoken123; contact=https://example.com/support)
- Old twtd multi user client format with 2-5 followers, e.g.
twtxt/0.1.0@69ac73b (Pod: example.com Followers: hugo kate Support: https://example.com/support)
- Old twtd multi user client format with 6 or more followers, e.g.
twtxt/0.1.0@37fd365 (Pod: example.com Followers: eugen hugo kate lieschen richard and 3 more... https://example.com/whoFollows?uri=https://example.org/twtxt.txt&nick=steffi&token=OzcdPbe6Z Support: https://example.com/support)
In case of both old twtd formats, all followers directly found in the
User-Agent
header are extracted and their twtxt.txt URLs constructed
from the support URL or Who Follows Resource URL.
The Who Follows Resources are not queried by default. The latest URLs
(assuming newest log records are always appended to the access log) of
each encountered hostname are printed. Operators can either manually
query their followers or use the -r
flag to resolve them
automatically.
Example Usage
$ useragent < /somewhere/twtxt.log
Twtxt UAs: 16841 Non-Twtxt UAs: 1709
343 @kate → http://example.com/user/kate/twtxt.txt
4309 @eugen → http://example.com/user/eugen/twtxt.txt
34 @lieschen → http://example.com/user/lieschen/twtxt.txt
34 @hugo → http://example.com/user/hugo/twtxt.txt
9902 @richard → http://example.com/user/richard/twtxt.txt
983 @peter → https://example.org/peter.txt
32900 @somebody → https://example.com/twtxt.txt
34 http://example.com/whoFollows?followers=8&token=gLvOWbFYT
Analyzing Unknown User-Agent
s
The -u
or --show-unknown
flag prints all the unknown User-Agent
s
in raw form with their occurrences to stdout, too. This way the program
can be further tuned and new twtxt formats built into it.
The -c
or --classify-unknown
flag classifies and prints all the
unkown User-Agent
s with their occurrences to stdout. It uses a very
simple builtin mapping table for Twtxt clients, bots, libraries and
browsers. Please note: this mapping is far from complete and even may
not produce correct classifications. This flag implies
-u
/--show-unknown
.
The -g
or --group-unknown
flag groups and prints all the classified
unknown User-Agent
s with their occurrences to stdout. This is similar
to -c
/--classify-unknown
, except that the output is even more
condensed and only the groups of the classified agents are shown. Groups
are Twtxt client
, Bots
, Libraries
, Browsers
, -
(if empty) and
(???)
(for unmapped agents). This flag implies both -uc
or
--show-unknown --classify-unknown
.
Resolving and Fetching Who Follow Resources
The -r
or --resolve-who-follows
flag attempts to resolve nicks and
feed URLs from all the encountered Who Follow Resources. Instead of the
Who Follow Resource URLs, the nicks and feed URLs will be printed
together with the number of hits.
In order to change the default timeout of 1s
when resolving the Who
Follows Resources, -t
or --timeout
can be used to raise or lower it.
This flag accepts everything, that Go's time.ParseDuration(…)
function recognizes:
A duration string is a possibly signed sequence of decimal numbers, each with optional fraction and a unit suffix, such as
300ms
,-1.5h
or2h45m
. Valid time units arens
,us
(orµs
),ms
,s
,m
,h
.
Please note, that yarnd's multi user User-Agents contain a token, that
will be invalidated, once used. So successive calls to the same Who
Follows Resource will fail until a new feed fetch with a new token gets
logged. useragent
will neither keep track of used URLs nor cache the
resolved nicks and feed URLs for future reuse.
Supported Who Follow Resource Formats
- Of course the official Who Follow Resource JSON
format
is supported, e.g.
{"somebody": "https://example.com/user/somebody/twtxt.txt"}
- Yarnd replied with a buggy
JSON array of objects format for quite some time, e.g.
[{"Nick": "somebody", "URL": "https://example.com/user/somebody/twtxt.txt", "LastFetchedAt": "2022-01-02T23:04:22.208290302Z"}]
- Yarnd even sent empty
URL
but filledURI
fields, e.g.[{"URI": "https://example.com/user/somebody/twtxt.txt", "URL": "", "Nick": "somebody", "LastSeenAt": "2022-01-03T08:54:36.638147948Z"}]
Licnese
useragent
is licensed under the terms of the AGPLv3 and the code
was originally hard-forked and borrowed (with permission) from @lyse@lyse.isobeef.org's
tt Twtxt client.