Following the same feed via different URLs #151

Open
opened 4 days ago by lyse · 13 comments
lyse commented 4 days ago
Owner

As discovered in the Saturday call last weekend, there is a corner case in yarnd that should be addressed.

Let there be a feed that is available via different URLs. Suppose the feed is served over four different protocols, HTTP, HTTPS, Gopher and Gemini (or just any subset of them). However, the issue is not limited to just different protocols, even over the same protocol there could be copies of the same feed or redirects in place. For brevity, let's limit the general problem to different protocols for futher discussion.

If user A subscribes to the same feed via HTTP, user B via HTTPS, user C via Gopher and user D via Gemini, yarnd will probably (I did not check the code) fetch the same logical feed four times in the same download cycle.

To make things a bit worse, since Gopher and Gemini do not offer any caching header concepts, the entire feed is always transmitted, even there are no changes in the feed, wasting precious bandwidth. In constrast, HTTP(S) replies can be much shorter 304 Not Modified responses, if properly configured.

By looking at the different url metadata fields in the feed, yarnd could detect that the four feeds are actually the same, provided that the feed correctly ships these information. With that information it could deduplicate the fetch list of URLs and only query the feed once.

The question is, should the subscription list of all the users, who follow that logical feed, be rewritten to a canocial URL? If so, which would it be? Always the HTTPS URL, followed by HTTP, if there is not HTTPS URL available, followed by any Gopher and Gemini? Or the first url which is also used to calculate the twt hashes?

When downloading, it would be possible if the feed is not reachable over one protocol, yarnd retries with another protocol and get the feed updates this way.

Something similar like feed URL normalization might come in handy here. Ingore the database part here. Or put differently, such a database could benefit from a solution to this issue, too.

As discovered in the Saturday call last weekend, there is a corner case in yarnd that should be addressed. Let there be a feed that is available via different URLs. Suppose the feed is served over four different protocols, HTTP, HTTPS, Gopher and Gemini (or just any subset of them). However, the issue is not limited to just different protocols, even over the same protocol there could be copies of the same feed or redirects in place. For brevity, let's limit the general problem to different protocols for futher discussion. If user A subscribes to the same feed via HTTP, user B via HTTPS, user C via Gopher and user D via Gemini, yarnd will probably (I did not check the code) fetch the same logical feed four times in the same download cycle. To make things a bit worse, since Gopher and Gemini do not offer any caching header concepts, the entire feed is always transmitted, even there are no changes in the feed, wasting precious bandwidth. In constrast, HTTP(S) replies can be much shorter 304 Not Modified responses, if properly configured. By looking at the different `url` metadata fields in the feed, yarnd could detect that the four feeds are actually the same, provided that the feed correctly ships these information. With that information it could deduplicate the fetch list of URLs and only query the feed once. The question is, should the subscription list of all the users, who follow that logical feed, be rewritten to a canocial URL? If so, which would it be? Always the HTTPS URL, followed by HTTP, if there is not HTTPS URL available, followed by any Gopher and Gemini? Or the first `url` which is also used to calculate the twt hashes? When downloading, it would be possible if the feed is not reachable over one protocol, yarnd retries with another protocol and get the feed updates this way. Something similar like [feed URL normalization](https://git.isobeef.org/lyse/twtxt-feed-url-normalization-database) might come in handy here. Ingore the database part here. Or put differently, such a database could benefit from a solution to this issue, too.
Owner

(haven't read the full OP ocntent yet...), but:

If user A subscribes to the same feed via HTTP, user B via HTTPS, user C via Gopher and user D via Gemini, yarnd will probably (I did not check the code) fetch the same logical feed four times in the same download cycle.

I think you are right. And I think this is becuase at that point in time we don't really have enough information to go on as to whether or not a given URI might happen to be the same "feed" as another URI.

I think this is a general problem entirely and not just a problem with yarnd, although its more "exhasibated" (becuase multi-user).

(haven't read the full OP ocntent yet...), but: > If user A subscribes to the same feed via HTTP, user B via HTTPS, user C via Gopher and user D via Gemini, yarnd will probably (I did not check the code) fetch the same logical feed four times in the same download cycle. I _think_ you are right. And I _think_ this is becuase at that point in time we don't really have enough information to go on as to whether or not a given URI _might_ happen to be the same "feed" as another URI. I _think_ this is a general problem entirely and not just a problem with `yarnd`, although its more "exhasibated" (becuase multi-user).
Owner

To make things a bit worse, since Gopher and Gemini do not offer any caching header concepts, the entire feed is always transmitted, even there are no changes in the feed, wasting precious bandwidth. In constrast, HTTP(S) replies can be much shorter 304 Not Modified responses, if properly configured.

Yes this is true. Not really sure we can do much here? 🤔

> To make things a bit worse, since Gopher and Gemini do not offer any caching header concepts, the entire feed is always transmitted, even there are no changes in the feed, wasting precious bandwidth. In constrast, HTTP(S) replies can be much shorter 304 Not Modified responses, if properly configured. Yes this is true. Not really sure we can do much here? 🤔
Owner

The question is, should the subscription list of all the users, who follow that logical feed, be rewritten to a canocial URL? If so, which would it be? Always the HTTPS URL, followed by HTTP, if there is not HTTPS URL available, followed by any Gopher and Gemini? Or the first url which is also used to calculate the twt hashes?

Its actually funny you raise this Issue, as I actually tried to explore this myself on my own just a few weeks ago with all the 3 different "kevin" feed(s), one of which has wrong/incorrect "# url" metadata 🤦‍♂️ -- Sadly I abandonded the work as I really wasn't sure myself.

I think we have basically a "feed ietntiy" problem right? How should we address this? 🤔

> The question is, should the subscription list of all the users, who follow that logical feed, be rewritten to a canocial URL? If so, which would it be? Always the HTTPS URL, followed by HTTP, if there is not HTTPS URL available, followed by any Gopher and Gemini? Or the first url which is also used to calculate the twt hashes? Its actually funny you raise this Issue, as I _actually_ tried to explore this myself on my own just a few weeks ago with all the 3 different "kevin" feed(s), one of which has wrong/incorrect "# url" metadata 🤦‍♂️ -- Sadly I abandonded the work as I really wasn't sure myself. I think we have basically a "feed ietntiy" problem right? How should we address this? 🤔
prologic added the
help wanted
question
labels 4 days ago
Owner

@lyse Ooops I just noticed the repo you posted this Issue on, oh well it doesn't matter really, but Ooops 😅

@lyse Ooops I just noticed the repo you posted this Issue on, oh well it doesn't matter really, but Ooops 😅
Poster
Owner

Whoops, sorry for the wrong repository!

And I think this is becuase at that point in time we don't really have enough information to go on as to whether or not a given URI might happen to be the same "feed" as another URI.

Provided, that the url metadata are properly maintained, yarnd could check against its database/cache/whatever if that new feed URL, that is about to be added by somebody, is already known and then ignore it. I suspect (didn't check) that adding a new feed will already do something similar, just against the list of actually subscribed feed URLs. This check could be extended to include all the "feed aliases".

I think this is a general problem entirely and not just a problem with yarnd, although its more "exhasibated" (becuase multi-user).

Yeah, this could theoretically happen in every client. However, chances are really slim that a user of a single-user client will actually end up subscribing to the same feed via different URLs. Basically zero in my opinion.

To make things a bit worse, since Gopher and Gemini do not offer any caching header concepts, the entire feed is always transmitted, even there are no changes in the feed, wasting precious bandwidth. In constrast, HTTP(S) replies can be much shorter 304 Not Modified responses, if properly configured.

Yes this is true. Not really sure we can do much here? 🤔

Exactly, there is nothing that can be done in yarnd. Except from prefering HTTP(S) if a feed is known to offer different protocols. But that is tied to the question how to define the "feed identity", as you called it.

Security Considerations

Now I just noticed that any logic on that matter would open up another can of worms. Malicious parties could publish a feed with all sorts of feed aliases (multiple url metadata) that are in fact completely different feeds. Then, yarnd would suddenly not fetch these feeds anymore, cutting off any traffic to several other, legitimate feeds.

No idea at the moment how to solve that. Maybe limit the feed aliases to the same hostname, but that would not work in all cases. Legitimate feeds might be served with different hostnames (e.g. https://www.uninformativ.de/twtxt.txt (with www subdomain) and gopher://uninformativ.de/0/twtxt.txt (without www subdomain)). They would not be recognized as the same feed. And users of multi user hosters (example.com/~eugen/twtxt.txt and example.com/~kate/twtxt.txt) could still stop other feeds from being downloaded.

Whoops, sorry for the wrong repository! > And I think this is becuase at that point in time we don't really have enough information to go on as to whether or not a given URI might happen to be the same "feed" as another URI. Provided, that the `url` metadata are properly maintained, yarnd could check against its database/cache/whatever if that new feed URL, that is about to be added by somebody, is already known and then ignore it. I suspect (didn't check) that adding a new feed will already do something similar, just against the list of actually subscribed feed URLs. This check could be extended to include all the "feed aliases". > I think this is a general problem entirely and not just a problem with yarnd, although its more "exhasibated" (becuase multi-user). Yeah, this could theoretically happen in every client. However, chances are really slim that a user of a single-user client will actually end up subscribing to the same feed via different URLs. Basically zero in my opinion. > > To make things a bit worse, since Gopher and Gemini do not offer any caching header concepts, the entire feed is always transmitted, even there are no changes in the feed, wasting precious bandwidth. In constrast, HTTP(S) replies can be much shorter 304 Not Modified responses, if properly configured. > > Yes this is true. Not really sure we can do much here? 🤔 Exactly, there is nothing that can be done in yarnd. Except from prefering HTTP(S) if a feed is known to offer different protocols. But that is tied to the question how to define the "feed identity", as you called it. # Security Considerations Now I just noticed that any logic on that matter would open up another can of worms. Malicious parties could publish a feed with all sorts of feed aliases (multiple `url` metadata) that are in fact completely different feeds. Then, yarnd would suddenly not fetch these feeds anymore, cutting off any traffic to several other, legitimate feeds. No idea at the moment how to solve that. Maybe limit the feed aliases to the same hostname, but that would not work in all cases. Legitimate feeds might be served with different hostnames (e.g. https://**www.uninformativ.de**/twtxt.txt (with www subdomain) and gopher://**uninformativ.de**/0/twtxt.txt (without www subdomain)). They would not be recognized as the same feed. And users of multi user hosters (example.com/~eugen/twtxt.txt and example.com/~kate/twtxt.txt) could still stop other feeds from being downloaded.
Poster
Owner

Maybe fetching all feed aliases and if they declare the same feed URLs, then they could be considered legitimate. If a feed alias does not appear in the response, it would not be part of the alias list. Example:

Feed https://example.com/twtxt.txt:

url = https://example.com/twtxt.txt
url = https://example.org/legitimate-alias.txt
url = https://example.net/illegitimate-alias.txt

Feed https://example.org/legitimate-alias.txt:

url = https://example.com/twtxt.txt
url = https://example.org/legitimate-alias.txt
url = https://example.net/illegitimate-alias.txt

Feed https://example.net/illegitimate-alias.txt:

url = https://example.net/illegitimate-alias.txt

Both https://example.com/twtxt.txt and https://example.org/legitimate-alias.txt declare each other as feed aliases, hence they could be considered the same feed. Even though, both also claim that https://example.org/illegitimate-alias.txt would be a third alias, it would in fact not be part of that feed alias list, because it does not call the other two aliases of itself. Thus, it would be a separate logical feed.

If this idea is going to be persued, more thought needs to be put on transitive feed alias declarations. Whether to support them (and what rules to put on them) or not.

Maybe fetching all feed aliases and if they declare the same feed URLs, then they could be considered legitimate. If a feed alias does not appear in the response, it would not be part of the alias list. Example: Feed `https://example.com/twtxt.txt`: ``` url = https://example.com/twtxt.txt url = https://example.org/legitimate-alias.txt url = https://example.net/illegitimate-alias.txt ``` Feed `https://example.org/legitimate-alias.txt`: ``` url = https://example.com/twtxt.txt url = https://example.org/legitimate-alias.txt url = https://example.net/illegitimate-alias.txt ``` Feed `https://example.net/illegitimate-alias.txt`: ``` url = https://example.net/illegitimate-alias.txt ``` Both `https://example.com/twtxt.txt` and `https://example.org/legitimate-alias.txt` declare each other as feed aliases, hence they could be considered the same feed. Even though, both also claim that `https://example.org/illegitimate-alias.txt` would be a third alias, it would in fact not be part of that feed alias list, because it does not call the other two aliases of itself. Thus, it would be a separate logical feed. If this idea is going to be persued, more thought needs to be put on transitive feed alias declarations. Whether to support them (and what rules to put on them) or not.
Owner

Provided, that the url metadata are properly maintained, yarnd could check against its database/cache/whatever if that new feed URL, that is about to be added by somebody, is already known and then ignore it. I suspect (didn't check) that adding a new feed will already do something similar, just against the list of actually subscribed feed URLs. This check could be extended to include all the "feed aliases".

I think this is reasonable. We'd juave have to maintain a reverse mapping of URI -> Twter object in memory. If we find one, then we know it belongs to that Twter object that already exists and may have multiple URI(s).

> Provided, that the url metadata are properly maintained, yarnd could check against its database/cache/whatever if that new feed URL, that is about to be added by somebody, is already known and then ignore it. I suspect (didn't check) that adding a new feed will already do something similar, just against the list of actually subscribed feed URLs. This check could be extended to include all the "feed aliases". I _think_ this is reasonable. We'd juave have to maintain a reverse mapping of `URI -> Twter` object in memory. If we find one, then we know it belongs to that `Twter` object that already exists and _may_ have multiple `URI`(s).
Owner

Yeah, this could theoretically happen in every client. However, chances are really slim that a user of a single-user client will actually end up subscribing to the same feed via different URLs. Basically zero in my opinion.

True, or a single-user pod 😅 (also a single-user client)

> Yeah, this could theoretically happen in every client. However, chances are really slim that a user of a single-user client will actually end up subscribing to the same feed via different URLs. Basically zero in my opinion. True, or a single-user pod 😅 (also a single-user client)
Owner

Exactly, there is nothing that can be done in yarnd. Except from prefering HTTP(S) if a feed is known to offer different protocols. But that is tied to the question how to define the "feed identity", as you called it.

Actually I like the idea of "Preferred Protocols" here 👌

> Exactly, there is nothing that can be done in yarnd. Except from prefering HTTP(S) if a feed is known to offer different protocols. But that is tied to the question how to define the "feed identity", as you called it. Actually I like the idea of "Preferred Protocols" here 👌
Owner

@lyse The big open "security question" / "security hole" is arguably one of the reasons I've never solved this (to date). This is a hard™ problem.

@lyse The big open "security question" / "security hole" is arguably one of the reasons I've never solved this (to date). This is a hard™ problem.
Owner

@lyse Yes we could validate feed URI(s). That would solve the security problem right? It means it would take several fetch cycles to properly cache feed URI aliases to the same Twter object I think. But that's probably fine.

@lyse Yes we _could_ validate feed URI(s). That would solve the security problem right? It means it would take several fetch cycles to properly cache feed URI aliases to the same Twter object I think. But that's probably fine.
Owner

Although I'm still not sure how this will work... Need to write some pseducode and do a sanity/security check on the logic...

Although I'm still not sure how this will work... Need to write some pseducode and do a sanity/security check on the logic...
Poster
Owner

Yes, doing this in a secure manner increases complexity by quite a lot. :-(

Yes, doing this in a secure manner increases complexity by quite a lot. :-(
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

No dependencies set.

Reference: yarnsocial/app#151
Loading…
There is no content yet.