As discovered in the Saturday call last weekend, there is a corner case in yarnd that should be addressed.
Let there be a feed that is available via different URLs. Suppose the feed is served over four different protocols, HTTP, HTTPS, Gopher and Gemini (or just any subset of them). However, the issue is not limited to just different protocols, even over the same protocol there could be copies of the same feed or redirects in place. For brevity, let's limit the general problem to different protocols for futher discussion.
If user A subscribes to the same feed via HTTP, user B via HTTPS, user C via Gopher and user D via Gemini, yarnd will probably (I did not check the code) fetch the same logical feed four times in the same download cycle.
To make things a bit worse, since Gopher and Gemini do not offer any caching header concepts, the entire feed is always transmitted, even there are no changes in the feed, wasting precious bandwidth. In constrast, HTTP(S) replies can be much shorter 304 Not Modified responses, if properly configured.
By looking at the different url metadata fields in the feed, yarnd could detect that the four feeds are actually the same, provided that the feed correctly ships these information. With that information it could deduplicate the fetch list of URLs and only query the feed once.
The question is, should the subscription list of all the users, who follow that logical feed, be rewritten to a canocial URL? If so, which would it be? Always the HTTPS URL, followed by HTTP, if there is not HTTPS URL available, followed by any Gopher and Gemini? Or the first url which is also used to calculate the twt hashes?
When downloading, it would be possible if the feed is not reachable over one protocol, yarnd retries with another protocol and get the feed updates this way.
Something similar like feed URL normalization might come in handy here. Ingore the database part here. Or put differently, such a database could benefit from a solution to this issue, too.
As discovered in the Saturday call last weekend, there is a corner case in yarnd that should be addressed.
Let there be a feed that is available via different URLs. Suppose the feed is served over four different protocols, HTTP, HTTPS, Gopher and Gemini (or just any subset of them). However, the issue is not limited to just different protocols, even over the same protocol there could be copies of the same feed or redirects in place. For brevity, let's limit the general problem to different protocols for futher discussion.
If user A subscribes to the same feed via HTTP, user B via HTTPS, user C via Gopher and user D via Gemini, yarnd will probably (I did not check the code) fetch the same logical feed four times in the same download cycle.
To make things a bit worse, since Gopher and Gemini do not offer any caching header concepts, the entire feed is always transmitted, even there are no changes in the feed, wasting precious bandwidth. In constrast, HTTP(S) replies can be much shorter 304 Not Modified responses, if properly configured.
By looking at the different `url` metadata fields in the feed, yarnd could detect that the four feeds are actually the same, provided that the feed correctly ships these information. With that information it could deduplicate the fetch list of URLs and only query the feed once.
The question is, should the subscription list of all the users, who follow that logical feed, be rewritten to a canocial URL? If so, which would it be? Always the HTTPS URL, followed by HTTP, if there is not HTTPS URL available, followed by any Gopher and Gemini? Or the first `url` which is also used to calculate the twt hashes?
When downloading, it would be possible if the feed is not reachable over one protocol, yarnd retries with another protocol and get the feed updates this way.
Something similar like [feed URL normalization](https://git.isobeef.org/lyse/twtxt-feed-url-normalization-database) might come in handy here. Ingore the database part here. Or put differently, such a database could benefit from a solution to this issue, too.
If user A subscribes to the same feed via HTTP, user B via HTTPS, user C via Gopher and user D via Gemini, yarnd will probably (I did not check the code) fetch the same logical feed four times in the same download cycle.
I think you are right. And I think this is becuase at that point in time we don't really have enough information to go on as to whether or not a given URI might happen to be the same "feed" as another URI.
I think this is a general problem entirely and not just a problem with yarnd, although its more "exhasibated" (becuase multi-user).
(haven't read the full OP ocntent yet...), but:
> If user A subscribes to the same feed via HTTP, user B via HTTPS, user C via Gopher and user D via Gemini, yarnd will probably (I did not check the code) fetch the same logical feed four times in the same download cycle.
I _think_ you are right. And I _think_ this is becuase at that point in time we don't really have enough information to go on as to whether or not a given URI _might_ happen to be the same "feed" as another URI.
I _think_ this is a general problem entirely and not just a problem with `yarnd`, although its more "exhasibated" (becuase multi-user).
To make things a bit worse, since Gopher and Gemini do not offer any caching header concepts, the entire feed is always transmitted, even there are no changes in the feed, wasting precious bandwidth. In constrast, HTTP(S) replies can be much shorter 304 Not Modified responses, if properly configured.
Yes this is true. Not really sure we can do much here? 🤔
> To make things a bit worse, since Gopher and Gemini do not offer any caching header concepts, the entire feed is always transmitted, even there are no changes in the feed, wasting precious bandwidth. In constrast, HTTP(S) replies can be much shorter 304 Not Modified responses, if properly configured.
Yes this is true. Not really sure we can do much here? 🤔
The question is, should the subscription list of all the users, who follow that logical feed, be rewritten to a canocial URL? If so, which would it be? Always the HTTPS URL, followed by HTTP, if there is not HTTPS URL available, followed by any Gopher and Gemini? Or the first url which is also used to calculate the twt hashes?
Its actually funny you raise this Issue, as I actually tried to explore this myself on my own just a few weeks ago with all the 3 different "kevin" feed(s), one of which has wrong/incorrect "# url" metadata 🤦♂️ -- Sadly I abandonded the work as I really wasn't sure myself.
I think we have basically a "feed ietntiy" problem right? How should we address this? 🤔
> The question is, should the subscription list of all the users, who follow that logical feed, be rewritten to a canocial URL? If so, which would it be? Always the HTTPS URL, followed by HTTP, if there is not HTTPS URL available, followed by any Gopher and Gemini? Or the first url which is also used to calculate the twt hashes?
Its actually funny you raise this Issue, as I _actually_ tried to explore this myself on my own just a few weeks ago with all the 3 different "kevin" feed(s), one of which has wrong/incorrect "# url" metadata 🤦♂️ -- Sadly I abandonded the work as I really wasn't sure myself.
I think we have basically a "feed ietntiy" problem right? How should we address this? 🤔
And I think this is becuase at that point in time we don't really have enough information to go on as to whether or not a given URI might happen to be the same "feed" as another URI.
Provided, that the url metadata are properly maintained, yarnd could check against its database/cache/whatever if that new feed URL, that is about to be added by somebody, is already known and then ignore it. I suspect (didn't check) that adding a new feed will already do something similar, just against the list of actually subscribed feed URLs. This check could be extended to include all the "feed aliases".
I think this is a general problem entirely and not just a problem with yarnd, although its more "exhasibated" (becuase multi-user).
Yeah, this could theoretically happen in every client. However, chances are really slim that a user of a single-user client will actually end up subscribing to the same feed via different URLs. Basically zero in my opinion.
To make things a bit worse, since Gopher and Gemini do not offer any caching header concepts, the entire feed is always transmitted, even there are no changes in the feed, wasting precious bandwidth. In constrast, HTTP(S) replies can be much shorter 304 Not Modified responses, if properly configured.
Yes this is true. Not really sure we can do much here? 🤔
Exactly, there is nothing that can be done in yarnd. Except from prefering HTTP(S) if a feed is known to offer different protocols. But that is tied to the question how to define the "feed identity", as you called it.
Security Considerations
Now I just noticed that any logic on that matter would open up another can of worms. Malicious parties could publish a feed with all sorts of feed aliases (multiple url metadata) that are in fact completely different feeds. Then, yarnd would suddenly not fetch these feeds anymore, cutting off any traffic to several other, legitimate feeds.
No idea at the moment how to solve that. Maybe limit the feed aliases to the same hostname, but that would not work in all cases. Legitimate feeds might be served with different hostnames (e.g. https://www.uninformativ.de/twtxt.txt (with www subdomain) and gopher://uninformativ.de/0/twtxt.txt (without www subdomain)). They would not be recognized as the same feed. And users of multi user hosters (example.com/~eugen/twtxt.txt and example.com/~kate/twtxt.txt) could still stop other feeds from being downloaded.
Whoops, sorry for the wrong repository!
> And I think this is becuase at that point in time we don't really have enough information to go on as to whether or not a given URI might happen to be the same "feed" as another URI.
Provided, that the `url` metadata are properly maintained, yarnd could check against its database/cache/whatever if that new feed URL, that is about to be added by somebody, is already known and then ignore it. I suspect (didn't check) that adding a new feed will already do something similar, just against the list of actually subscribed feed URLs. This check could be extended to include all the "feed aliases".
> I think this is a general problem entirely and not just a problem with yarnd, although its more "exhasibated" (becuase multi-user).
Yeah, this could theoretically happen in every client. However, chances are really slim that a user of a single-user client will actually end up subscribing to the same feed via different URLs. Basically zero in my opinion.
> > To make things a bit worse, since Gopher and Gemini do not offer any caching header concepts, the entire feed is always transmitted, even there are no changes in the feed, wasting precious bandwidth. In constrast, HTTP(S) replies can be much shorter 304 Not Modified responses, if properly configured.
>
> Yes this is true. Not really sure we can do much here? 🤔
Exactly, there is nothing that can be done in yarnd. Except from prefering HTTP(S) if a feed is known to offer different protocols. But that is tied to the question how to define the "feed identity", as you called it.
# Security Considerations
Now I just noticed that any logic on that matter would open up another can of worms. Malicious parties could publish a feed with all sorts of feed aliases (multiple `url` metadata) that are in fact completely different feeds. Then, yarnd would suddenly not fetch these feeds anymore, cutting off any traffic to several other, legitimate feeds.
No idea at the moment how to solve that. Maybe limit the feed aliases to the same hostname, but that would not work in all cases. Legitimate feeds might be served with different hostnames (e.g. https://**www.uninformativ.de**/twtxt.txt (with www subdomain) and gopher://**uninformativ.de**/0/twtxt.txt (without www subdomain)). They would not be recognized as the same feed. And users of multi user hosters (example.com/~eugen/twtxt.txt and example.com/~kate/twtxt.txt) could still stop other feeds from being downloaded.
Maybe fetching all feed aliases and if they declare the same feed URLs, then they could be considered legitimate. If a feed alias does not appear in the response, it would not be part of the alias list. Example:
Both https://example.com/twtxt.txt and https://example.org/legitimate-alias.txt declare each other as feed aliases, hence they could be considered the same feed. Even though, both also claim that https://example.org/illegitimate-alias.txt would be a third alias, it would in fact not be part of that feed alias list, because it does not call the other two aliases of itself. Thus, it would be a separate logical feed.
If this idea is going to be persued, more thought needs to be put on transitive feed alias declarations. Whether to support them (and what rules to put on them) or not.
Maybe fetching all feed aliases and if they declare the same feed URLs, then they could be considered legitimate. If a feed alias does not appear in the response, it would not be part of the alias list. Example:
Feed `https://example.com/twtxt.txt`:
```
url = https://example.com/twtxt.txt
url = https://example.org/legitimate-alias.txt
url = https://example.net/illegitimate-alias.txt
```
Feed `https://example.org/legitimate-alias.txt`:
```
url = https://example.com/twtxt.txt
url = https://example.org/legitimate-alias.txt
url = https://example.net/illegitimate-alias.txt
```
Feed `https://example.net/illegitimate-alias.txt`:
```
url = https://example.net/illegitimate-alias.txt
```
Both `https://example.com/twtxt.txt` and `https://example.org/legitimate-alias.txt` declare each other as feed aliases, hence they could be considered the same feed. Even though, both also claim that `https://example.org/illegitimate-alias.txt` would be a third alias, it would in fact not be part of that feed alias list, because it does not call the other two aliases of itself. Thus, it would be a separate logical feed.
If this idea is going to be persued, more thought needs to be put on transitive feed alias declarations. Whether to support them (and what rules to put on them) or not.
Provided, that the url metadata are properly maintained, yarnd could check against its database/cache/whatever if that new feed URL, that is about to be added by somebody, is already known and then ignore it. I suspect (didn't check) that adding a new feed will already do something similar, just against the list of actually subscribed feed URLs. This check could be extended to include all the "feed aliases".
I think this is reasonable. We'd juave have to maintain a reverse mapping of URI -> Twter object in memory. If we find one, then we know it belongs to that Twter object that already exists and may have multiple URI(s).
> Provided, that the url metadata are properly maintained, yarnd could check against its database/cache/whatever if that new feed URL, that is about to be added by somebody, is already known and then ignore it. I suspect (didn't check) that adding a new feed will already do something similar, just against the list of actually subscribed feed URLs. This check could be extended to include all the "feed aliases".
I _think_ this is reasonable. We'd juave have to maintain a reverse mapping of `URI -> Twter` object in memory. If we find one, then we know it belongs to that `Twter` object that already exists and _may_ have multiple `URI`(s).
Yeah, this could theoretically happen in every client. However, chances are really slim that a user of a single-user client will actually end up subscribing to the same feed via different URLs. Basically zero in my opinion.
True, or a single-user pod 😅 (also a single-user client)
> Yeah, this could theoretically happen in every client. However, chances are really slim that a user of a single-user client will actually end up subscribing to the same feed via different URLs. Basically zero in my opinion.
True, or a single-user pod 😅 (also a single-user client)
Exactly, there is nothing that can be done in yarnd. Except from prefering HTTP(S) if a feed is known to offer different protocols. But that is tied to the question how to define the "feed identity", as you called it.
Actually I like the idea of "Preferred Protocols" here 👌
> Exactly, there is nothing that can be done in yarnd. Except from prefering HTTP(S) if a feed is known to offer different protocols. But that is tied to the question how to define the "feed identity", as you called it.
Actually I like the idea of "Preferred Protocols" here 👌
@lyse Yes we could validate feed URI(s). That would solve the security problem right? It means it would take several fetch cycles to properly cache feed URI aliases to the same Twter object I think. But that's probably fine.
@lyse Yes we _could_ validate feed URI(s). That would solve the security problem right? It means it would take several fetch cycles to properly cache feed URI aliases to the same Twter object I think. But that's probably fine.
As discovered in the Saturday call last weekend, there is a corner case in yarnd that should be addressed.
Let there be a feed that is available via different URLs. Suppose the feed is served over four different protocols, HTTP, HTTPS, Gopher and Gemini (or just any subset of them). However, the issue is not limited to just different protocols, even over the same protocol there could be copies of the same feed or redirects in place. For brevity, let's limit the general problem to different protocols for futher discussion.
If user A subscribes to the same feed via HTTP, user B via HTTPS, user C via Gopher and user D via Gemini, yarnd will probably (I did not check the code) fetch the same logical feed four times in the same download cycle.
To make things a bit worse, since Gopher and Gemini do not offer any caching header concepts, the entire feed is always transmitted, even there are no changes in the feed, wasting precious bandwidth. In constrast, HTTP(S) replies can be much shorter 304 Not Modified responses, if properly configured.
By looking at the different
url
metadata fields in the feed, yarnd could detect that the four feeds are actually the same, provided that the feed correctly ships these information. With that information it could deduplicate the fetch list of URLs and only query the feed once.The question is, should the subscription list of all the users, who follow that logical feed, be rewritten to a canocial URL? If so, which would it be? Always the HTTPS URL, followed by HTTP, if there is not HTTPS URL available, followed by any Gopher and Gemini? Or the first
url
which is also used to calculate the twt hashes?When downloading, it would be possible if the feed is not reachable over one protocol, yarnd retries with another protocol and get the feed updates this way.
Something similar like feed URL normalization might come in handy here. Ingore the database part here. Or put differently, such a database could benefit from a solution to this issue, too.
(haven't read the full OP ocntent yet...), but:
I think you are right. And I think this is becuase at that point in time we don't really have enough information to go on as to whether or not a given URI might happen to be the same "feed" as another URI.
I think this is a general problem entirely and not just a problem with
yarnd
, although its more "exhasibated" (becuase multi-user).Yes this is true. Not really sure we can do much here? 🤔
Its actually funny you raise this Issue, as I actually tried to explore this myself on my own just a few weeks ago with all the 3 different "kevin" feed(s), one of which has wrong/incorrect "# url" metadata 🤦♂️ -- Sadly I abandonded the work as I really wasn't sure myself.
I think we have basically a "feed ietntiy" problem right? How should we address this? 🤔
@lyse Ooops I just noticed the repo you posted this Issue on, oh well it doesn't matter really, but Ooops 😅
Whoops, sorry for the wrong repository!
Provided, that the
url
metadata are properly maintained, yarnd could check against its database/cache/whatever if that new feed URL, that is about to be added by somebody, is already known and then ignore it. I suspect (didn't check) that adding a new feed will already do something similar, just against the list of actually subscribed feed URLs. This check could be extended to include all the "feed aliases".Yeah, this could theoretically happen in every client. However, chances are really slim that a user of a single-user client will actually end up subscribing to the same feed via different URLs. Basically zero in my opinion.
Exactly, there is nothing that can be done in yarnd. Except from prefering HTTP(S) if a feed is known to offer different protocols. But that is tied to the question how to define the "feed identity", as you called it.
Security Considerations
Now I just noticed that any logic on that matter would open up another can of worms. Malicious parties could publish a feed with all sorts of feed aliases (multiple
url
metadata) that are in fact completely different feeds. Then, yarnd would suddenly not fetch these feeds anymore, cutting off any traffic to several other, legitimate feeds.No idea at the moment how to solve that. Maybe limit the feed aliases to the same hostname, but that would not work in all cases. Legitimate feeds might be served with different hostnames (e.g. https://www.uninformativ.de/twtxt.txt (with www subdomain) and gopher://uninformativ.de/0/twtxt.txt (without www subdomain)). They would not be recognized as the same feed. And users of multi user hosters (example.com/~eugen/twtxt.txt and example.com/~kate/twtxt.txt) could still stop other feeds from being downloaded.
Maybe fetching all feed aliases and if they declare the same feed URLs, then they could be considered legitimate. If a feed alias does not appear in the response, it would not be part of the alias list. Example:
Feed
https://example.com/twtxt.txt
:Feed
https://example.org/legitimate-alias.txt
:Feed
https://example.net/illegitimate-alias.txt
:Both
https://example.com/twtxt.txt
andhttps://example.org/legitimate-alias.txt
declare each other as feed aliases, hence they could be considered the same feed. Even though, both also claim thathttps://example.org/illegitimate-alias.txt
would be a third alias, it would in fact not be part of that feed alias list, because it does not call the other two aliases of itself. Thus, it would be a separate logical feed.If this idea is going to be persued, more thought needs to be put on transitive feed alias declarations. Whether to support them (and what rules to put on them) or not.
I think this is reasonable. We'd juave have to maintain a reverse mapping of
URI -> Twter
object in memory. If we find one, then we know it belongs to thatTwter
object that already exists and may have multipleURI
(s).True, or a single-user pod 😅 (also a single-user client)
Actually I like the idea of "Preferred Protocols" here 👌
@lyse The big open "security question" / "security hole" is arguably one of the reasons I've never solved this (to date). This is a hard™ problem.
@lyse Yes we could validate feed URI(s). That would solve the security problem right? It means it would take several fetch cycles to properly cache feed URI aliases to the same Twter object I think. But that's probably fine.
Although I'm still not sure how this will work... Need to write some pseducode and do a sanity/security check on the logic...
Yes, doing this in a secure manner increases complexity by quite a lot. :-(