Tiberium 5 hours ago

Important to note that the author assumes that this is ByteDance, but the ASN belongs to their cloud solution BytePlus, which could be used by other companies.

https://x.com/sauceo_/status/1842866301066518875

https://www.byteplus.com/en

  • edouard-harris 4 hours ago

    The author does address this possibility in a reply:

    > it's very unlikely to be someone else because pricing is astronomical. you also have to "contact sales" to get access to anything outside of a free trial. no one would pay that much for a block of ips with terrible reputation

    https://x.com/uwukko/status/1842866807763308615

    • rfoo 3 hours ago

      > pricing is astronomical. you also have to "contact sales" to get access to anything outside of a free trial

      You don't have to contact sales if you are a Chinese-speaking customer. And pricing is fine. ByteDance has a different brand for their cloud services in China: https://www.volcengine.com/ [1]. But of course the underlying infrastructure are all the same.

      This is very likely done by a Chinese customer using ByteDance's cloud service.

      [1] Well, Alibaba Cloud did this too, and ByteDance is copying Alibaba 1:1 (who in turn is copying AWS) so I'm not surprised. But at least Alibaba named their international brand "Alibaba Cloud" and their CN one "AliCloud", similar enough.

    • Tiberium 4 hours ago

      Yes, but this is also pure speculation, since the product clearly exists, has customers, and even has a free trial.

  • teractiveodular 4 hours ago

    This. Bytedance's official spider has a clear User-Agent tagged Bytespider, but OP didn't mention what they're seeing.

    • jsheard 4 hours ago

      This isn't spider traffic though, the traffic pattern indicates that it's a special-purpose bot designed to hit Cobalts internal API in particular. A generic spider probably wouldn't even be able to find the API endpoints that are only referenced by Javascript, nevermind consistently hit the API with a valid video URL from a residential proxy then switch to a different IP address to download the result every time.

Thomashuet 5 hours ago

Short version: a service known for evading YouTube's bot protection is complaining that ByteDance is bypassing their own protections. I agree that it's not nice from ByteDance but I find it hypocrite from Cobalt to call it evil.

  • lunarmony 5 hours ago

    > cobalt was created for public benefit, to protect people from ads and malware pushed by its alternatives

    can't say the same for bytedance, which is designed to exploit users with various ads

    • appendix-rock 4 hours ago

      I feel like you’re missing the point on purpose? Cobalt is asserting that it’s doing good based on the shadier behaviour of its competitors. But can you justify Cobalt in isolation any more than you can justify whoever was scraping it?

    • whywhywhywhy 3 hours ago

      It was created for donation money, lets not do mental gymnastics to justify one type of scraping and vilify another. Scraping is scraping and it's either all fair game or it's not all fair game.

  • h4x0rr 5 hours ago

    You can't compare that... cobalt doesn't DDOS YouTube

    • jsheard 5 hours ago

      Cobalt is also completely free, without ads or any other monetization besides donations, it's purely meant to help normal people download videos for normal people purposes. It's not like they're a for-profit data harvesting outfit complaining about getting abused by another for-profit data harvesting outfit.

      • Thomashuet 5 hours ago

        You're just saying that Cobalt is small and non-profit so they must be good and YouTube and ByteDance are big and rich so they must be evil. But if you only look that what they are actually doing here, it's very similar: bypassing protections to use a service in a way that the service provider doesn't like.

        • phoronixrly 5 hours ago

          Bytedance and youtube are evil, but not beacause they are big and rich. Cobalt is good, but not because they are small and a non-profit.

        • loloquwowndueo 4 hours ago

          If bytedance are so big and rich why don’t they implement their own scraping solution instead of abusing a small service like cobalt.

          • sangnoir 11 minutes ago

            ...Because someone scraping from a Bytedance IP range is not necessarily Bytedance, just like requests from an AWS IP do not imply Amazon authored the spider

        • snvy 4 hours ago

          Cobalt is bypassing protections to allow legitimate Youtube users to download single videos without causing harm and with no monetary incentives. Bytedance is mass downloading thounsands of videos, all for monetary incentives while heavily breaking the TOS and potentially ignoring copyright laws. Similar, but one is doing way more harm than the other.

          • whywhywhywhy 3 hours ago

            > and with no monetary incentives

            Donations are a monetary incentive

            > while heavily breaking the TOS and potentially ignoring copyright laws

            Cobalt also breaks the TOS and ignores copyright laws, personally I don't think that matters but having a double standard when one company does it "It's ok when they do it" and when one you don't like does it you try to use copyright laws and TOS as a weapon just makes me think it really isn't about TOS or copyright is it.

            Also just gives YouTube ammunition to impose stricter protection against smaller violators like cobalt, like self running yt-dlp

    • criddell 5 hours ago

      Cobalt didn’t say the DDOS was evil, they said:

      “bytedance's scraper was specifically built to go around cloudflare & other web security solutions, which is just genuinely evil”

      So I would say it’s a fair comparison.

      • dewey 5 hours ago

        > built to go around cloudflare

        Then they either didn't set up CF correctly or they just use the mode in most headless browsers that bypasses default CF protection when CF is not in attack mode.

  • afavour 3 hours ago

    I don't see the hypocrisy here. Cobalt is a small, free service that results in Google (or so the argument goes) making less profit. ByteDance are a giant money printing machine using that free service for their own ends. They have more than enough resources to not abuse a free one.

conradfr 5 hours ago

Some time ago I noticed the ByteDance spider very aggressively scraping my modest side project and, more importantly, modest server.

I wrote to them to please stop (I think the address was in the user agent or something), they replied sorry and actually stopped.

Not sure why all these crawlers can't pace themselves.

  • throwaway98797 3 hours ago

    devs are promoted on how fast they get done

    faster, bigger, MOAR

    sometimes it’s hard to have nice things

xbmcuser 5 hours ago

I think Chinese isp can't store some data as they might get in trouble with Chinese censors so they dont cache it. And then if gets slightly viral you see huge traffic from 1 IP that might be a vpn. On reddit torrent channel you get similar question when ahem Linux iso is downloaded 1000s of times from same ip

  • lithiumii 4 hours ago

    That could be a completely different problem. In China many people run PCDN (p2p CDN) for profit. The ISPs detect (and ban) such PCDN nodes by checking your uploaded / downloaded ratio. To increase this ratio thus avoid being detected, these people download popular torrents again and again without uploading at all.

HeralFacker 4 hours ago

Blackhole Bytedance's ASNs. Cobalt is an end-user tool, so there's not much legitimacy to a cloud service accessing it.

horsebridge 5 hours ago

Anybody running a site with data that is useful for AI will learn how horrible bytedance is.

3np 5 hours ago

Interesting timing. The last ~month or so we've seen a drastic shift in YouTube availability. Stricter enforcement of authentication tokens (including breaking some legacy clients) and IP blocking. Loads of Invidious instances either shut down or not able to serve videos anymore. yt-dlp not working at all over an increasing number of VPNs and proxies.

Maybe this is some ByteDance engineers getting really desperate and resorting to abusing every youtube proxy service they can because apparently they do have a residential proxy network which doesn't cut it anymore?

Unless it's just a cost-optimization measure (residential proxy traffic is relatively pricey).

  • A4ET8a8uTh0 5 hours ago

    Yeah, I noticed this as well. I think the window of what some might remember as old youtube is closing forever sooner than anticipated. As I may have suggested on this forum before, if you have anything in particular you want to archive, you would be wise to have a plan to do it sooner rather than later. Space is cheap enough and I assume most people won't want to archive the entire net ( I know data hoarders exist and god bless them, but I assume they will be ok ).

    • Wowfunhappy 4 hours ago

      Short of full-on using Widevine/eme for all videos (which I assume would lock out too many devices), how much more could Youtube do? As long as the data is being streamed to your computer, there will be a way to capture it, right?

      • 3np 4 hours ago

        I can very much imagine site-wide requiring Weidevine/eme for anything better quality than 480 and crusty audio not that long into the future.

        That's already the case for some (anecdotally increasing ) number of videos.

      • treyd 3 hours ago

        This would encourage a lot more people to want to break Widevine. :)

      • A4ET8a8uTh0 4 hours ago

        Qualified yes is probably in order ( and more knowledgeable person can likely chime-in if I misstate something ). It is and always has been a cat and mouse game not completely unlike with game or movie piracy. As you stated, if you can see it on your PC, there is likely means to capture it.

        Still, notice how most of the low effort avenues are slowly being cut off one by one. I will use non youtube example. Not that long ago, I was able to rip blurays using off the shelf external bluray writer, but new firmware on currently sold drives remove that ability.

        Now, Google typically won't be ( and isn't ) everyone's hardware provider, but there are ways they could degrade 'non-sanctioned' experience in browser they can ( and do ) control.

        Granted, in Firefox ( and other non-google browsers ) it may not be as simple, but future there is not as straightfoward either given Mozilla's trajectory and financial dependence ( and moves ).

        In short, I agree with you but note that initially it was genuinely trivial to download youtube videos. This has changed over the years.

seanhunter 4 hours ago

A few big sites that I'm familiar with have seen in the last six months ByteDance become by far the most agressive scraper in their logs.

FatalLogic 5 hours ago

>i can safely assume that bytedance was scraping youtube videos by abusing our private api

I'm not doubting the OP. But why is ByteDance doing this? What does that company get out of scraping YouTube?

ulrischa 5 hours ago

ByteDance ist also massively scraping official governemnetal sites with strange url patterns

sergiotapia 4 hours ago

why would they use cobalt instead of ytp-dl? is it to mask their origination IPs and such?

jsheard 5 hours ago

@uwukko's full thread for those who don't have a Twitter account:

earlier today i noticed very elevated traffic to cobalt api that looked a lot like ddos. it turned out to be bytedance!

we can't tell what videos they were downloading or where the original request comes from as it's built to go around all limiters, but there's still a pattern

first request: json post with content url & settings from a residential proxy

second request: tunnel with pseudo microsoft edge on windows user agent & youtube origin/referer, from byteplus ip

third request: same tunnel with aria2 user agent & no referer, also from byteplus ip

cobalt is a media downloader, mostly known for supporting youtube even at worst times. cobalt's tunnel is either a proxy stream or ffmpeg live render

considering all of this, i can safely assume that bytedance was scraping youtube videos by abusing our private api

with release of v10 we implemented cloudflare turnstile, but later disabled it due to access issues by a chunk of our users

enabling it back brought the server load to normal levels and stopped bytedance from choking our servers cuz they didn't account for this (yet)

before resorting to turnstile, i attempted using other cloudflare services, but none of them seemed to help much

my theory is that bytedance's scraper was specifically built to go around cloudflare & other web security solutions, which is just genuinely evil

this incident caused a few minutes of api unavailability, but taught me that cobalt (and probably anything else) can no longer exist without active bot/scraping protection

im really glad that cloudflare turnstile exists because i don't know what i'd do without it here

byteplus AS that was spamming requests is 150436 and last seen ip range was 207.166.160.0/21

the amount of unique users on cloudflare analytics rapidly increased by 2.25 times and didn't go down since, while web analytics (plausible) show no increase whatsoever

  • gnfargbl 5 hours ago

    Sounds like they're using residential proxies for set-up in order to look like normal users, but then switching back to their own ASN for content because residential proxies are expensive.

    > im really glad that cloudflare turnstile exists because i don't know what i'd do without it here

    Why not just blackhole the byteplus ASN?

    • sandworm101 4 hours ago

      And how many of those residential IPs belong to work-from-home bytedance employees running work laptops? Any large company these days has direct access to a pool of innocent residential IPs. The weaponization of that pool may be more evil than the actual scraping imho.

  • miki123211 4 hours ago

    Bytedance seems to have increased its scraping efforts significantly.

    I've posted a canary token[1] URL as a Mastodon post, to check how scrape-resistant Mastodon actually is (it is not resistant at all), and have been getting quite a few hits from the ByteDance spider recently.

    Last hit is from 47.128.114.151, Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)

    Edit: added missing footnote.

    [1] https://canarytokens.org

    • diggan 3 hours ago

      > to check how scrape-resistant Mastodon actually is (it is not resistant at all)

      That's expected, no? It's a social network that is explicitly designed to be as open as possible, as it's using ActivityPub. To be "scraping resisting" would be to go against the very goal of Mastodon.

      • Aachen 3 hours ago

        Exactly, this is how I want it to be. I post there because it's not another walled garden that profits from lock-in