ByteDance is abusing the free video downloading service Cobalt for mass scraping

138 points by jsheard a year ago

Tiberium a year ago

Important to note that the author assumes that this is ByteDance, but the ASN belongs to their cloud solution BytePlus, which could be used by other companies.

https://x.com/sauceo_/status/1842866301066518875

https://www.byteplus.com/en

edouard-harris a year ago

The author does address this possibility in a reply:
> it's very unlikely to be someone else because pricing is astronomical. you also have to "contact sales" to get access to anything outside of a free trial. no one would pay that much for a block of ips with terrible reputation
https://x.com/uwukko/status/1842866807763308615
- rfoo a year ago
  
  > pricing is astronomical. you also have to "contact sales" to get access to anything outside of a free trial
  You don't have to contact sales if you are a Chinese-speaking customer. And pricing is fine. ByteDance has a different brand for their cloud services in China: https://www.volcengine.com/ [1]. But of course the underlying infrastructure are all the same.
  This is very likely done by a Chinese customer using ByteDance's cloud service.
  [1] Well, Alibaba Cloud did this too, and ByteDance is copying Alibaba 1:1 (who in turn is copying AWS) so I'm not surprised. But at least Alibaba named their international brand "Alibaba Cloud" and their CN one "AliCloud", similar enough.
- Tiberium a year ago
  
  Yes, but this is also pure speculation, since the product clearly exists, has customers, and even has a free trial.
teractiveodular a year ago

This. Bytedance's official spider has a clear User-Agent tagged Bytespider, but OP didn't mention what they're seeing.
- jsheard a year ago
  
  This isn't spider traffic though, the traffic pattern indicates that it's a special-purpose bot designed to hit Cobalts internal API in particular. A generic spider probably wouldn't even be able to find the API endpoints that are only referenced by Javascript, nevermind consistently hit the API with a valid video URL from a residential proxy then switch to a different IP address to download the result every time.

Thomashuet a year ago

Short version: a service known for evading YouTube's bot protection is complaining that ByteDance is bypassing their own protections. I agree that it's not nice from ByteDance but I find it hypocrite from Cobalt to call it evil.

lunarmony a year ago

> cobalt was created for public benefit, to protect people from ads and malware pushed by its alternatives
can't say the same for bytedance, which is designed to exploit users with various ads
- whywhywhywhy a year ago
  
  It was created for donation money, lets not do mental gymnastics to justify one type of scraping and vilify another. Scraping is scraping and it's either all fair game or it's not all fair game.
- appendix-rock a year ago
  
  I feel like you’re missing the point on purpose? Cobalt is asserting that it’s doing good based on the shadier behaviour of its competitors. But can you justify Cobalt in isolation any more than you can justify whoever was scraping it?
  - HeatrayEnjoyer a year ago
    
    Yes.
  - dangsux a year ago
    
    [dead]
h4x0rr a year ago

You can't compare that... cobalt doesn't DDOS YouTube
- jsheard a year ago
  
  Cobalt is also completely free, without ads or any other monetization besides donations, it's purely meant to help normal people download videos for normal people purposes. It's not like they're a for-profit data harvesting outfit complaining about getting abused by another for-profit data harvesting outfit.
  - Thomashuet a year ago
    
    You're just saying that Cobalt is small and non-profit so they must be good and YouTube and ByteDance are big and rich so they must be evil. But if you only look that what they are actually doing here, it's very similar: bypassing protections to use a service in a way that the service provider doesn't like.
    
    phoronixrly a year ago
    
    Bytedance and youtube are evil, but not beacause they are big and rich. Cobalt is good, but not because they are small and a non-profit.
    
    loloquwowndueo a year ago
    
    If bytedance are so big and rich why don’t they implement their own scraping solution instead of abusing a small service like cobalt.
    
    sangnoir a year ago
    
    ...Because someone scraping from a Bytedance IP range is not necessarily Bytedance, just like requests from an AWS IP do not imply Amazon authored the spider
    
    skeaker a year ago
    
    In isolation, a thief masquerading as a security system technician and an actual technician both do good work by checking on your home security. You can't meaningfully say one is better than the other, because even though one is secretly casing out your home so he can rob it later, in isolation they're doing the same thing.
    
    snvy a year ago
    
    Cobalt is bypassing protections to allow legitimate Youtube users to download single videos without causing harm and with no monetary incentives. Bytedance is mass downloading thounsands of videos, all for monetary incentives while heavily breaking the TOS and potentially ignoring copyright laws. Similar, but one is doing way more harm than the other.
    
    whywhywhywhy a year ago
    
    > and with no monetary incentives
    Donations are a monetary incentive
    > while heavily breaking the TOS and potentially ignoring copyright laws
    Cobalt also breaks the TOS and ignores copyright laws, personally I don't think that matters but having a double standard when one company does it "It's ok when they do it" and when one you don't like does it you try to use copyright laws and TOS as a weapon just makes me think it really isn't about TOS or copyright is it.
    Also just gives YouTube ammunition to impose stricter protection against smaller violators like cobalt, like self running yt-dlp
    
    rileyboi a year ago
    
    [dead]
- criddell a year ago
  
  Cobalt didn’t say the DDOS was evil, they said:
  “bytedance's scraper was specifically built to go around cloudflare & other web security solutions, which is just genuinely evil”
  So I would say it’s a fair comparison.
  - dewey a year ago
    
    > built to go around cloudflare
    Then they either didn't set up CF correctly or they just use the mode in most headless browsers that bypasses default CF protection when CF is not in attack mode.
    
    rileyboi a year ago
    
    [dead]
afavour a year ago

I don't see the hypocrisy here. Cobalt is a small, free service that results in Google (or so the argument goes) making less profit. ByteDance are a giant money printing machine using that free service for their own ends. They have more than enough resources to not abuse a free one.
- squeaky-clean a year ago
  
  Let's say hypothetically Cobalt was made by ByteDance as a way to scrape youtube and have a scapegoat. Is it still okay?
  If your opinion changes because the owner is different, even though the service stays the same, that's hypocritical.
  - afavour a year ago
    
    Of course it isn't hypocritical. It's like the old story of the poor man stealing bread to feed his starving family, of course circumstances matter. It's silly to suggest otherwise.
    
    squeaky-clean a year ago
    
    Using Cobalt doesn't do anything to feed their family. It's an old poor man stealing a blu-ray DVD vs a rich young man stealing a blu-ray DVD.
- dangsux a year ago
  
  [dead]
ascpixi a year ago

Apples to oranges - abusing an undocumented API of a foreign service to mass-scrape another one by proxy is not the same as sending singular, user-created requests.

conradfr a year ago

Some time ago I noticed the ByteDance spider very aggressively scraping my modest side project and, more importantly, modest server.

I wrote to them to please stop (I think the address was in the user agent or something), they replied sorry and actually stopped.

Not sure why all these crawlers can't pace themselves.

throwaway98797 a year ago

devs are promoted on how fast they get done
faster, bigger, MOAR
sometimes it’s hard to have nice things

3np a year ago

Interesting timing. The last ~month or so we've seen a drastic shift in YouTube availability. Stricter enforcement of authentication tokens (including breaking some legacy clients) and IP blocking. Loads of Invidious instances either shut down or not able to serve videos anymore. yt-dlp not working at all over an increasing number of VPNs and proxies.

Maybe this is some ByteDance engineers getting really desperate and resorting to abusing every youtube proxy service they can because apparently they do have a residential proxy network which doesn't cut it anymore?

Unless it's just a cost-optimization measure (residential proxy traffic is relatively pricey).

A4ET8a8uTh0 a year ago

Yeah, I noticed this as well. I think the window of what some might remember as old youtube is closing forever sooner than anticipated. As I may have suggested on this forum before, if you have anything in particular you want to archive, you would be wise to have a plan to do it sooner rather than later. Space is cheap enough and I assume most people won't want to archive the entire net ( I know data hoarders exist and god bless them, but I assume they will be ok ).
- Wowfunhappy a year ago
  
  Short of full-on using Widevine/eme for all videos (which I assume would lock out too many devices), how much more could Youtube do? As long as the data is being streamed to your computer, there will be a way to capture it, right?
  - A4ET8a8uTh0 a year ago
    
    Qualified yes is probably in order ( and more knowledgeable person can likely chime-in if I misstate something ). It is and always has been a cat and mouse game not completely unlike with game or movie piracy. As you stated, if you can see it on your PC, there is likely means to capture it.
    Still, notice how most of the low effort avenues are slowly being cut off one by one. I will use non youtube example. Not that long ago, I was able to rip blurays using off the shelf external bluray writer, but new firmware on currently sold drives remove that ability.
    Now, Google typically won't be ( and isn't ) everyone's hardware provider, but there are ways they could degrade 'non-sanctioned' experience in browser they can ( and do ) control.
    Granted, in Firefox ( and other non-google browsers ) it may not be as simple, but future there is not as straightfoward either given Mozilla's trajectory and financial dependence ( and moves ).
    In short, I agree with you but note that initially it was genuinely trivial to download youtube videos. This has changed over the years.
  - 3np a year ago
    
    I can very much imagine site-wide requiring Weidevine/eme for anything better quality than 480 and crusty audio not that long into the future.
    That's already the case for some (anecdotally increasing ) number of videos.
    
    HeatrayEnjoyer a year ago
    
    Which YouTube videos require that?
  - treyd a year ago
    
    This would encourage a lot more people to want to break Widevine. :)

xbmcuser a year ago

I think Chinese isp can't store some data as they might get in trouble with Chinese censors so they dont cache it. And then if gets slightly viral you see huge traffic from 1 IP that might be a vpn. On reddit torrent channel you get similar question when ahem Linux iso is downloaded 1000s of times from same ip

lithiumii a year ago

That could be a completely different problem. In China many people run PCDN (p2p CDN) for profit. The ISPs detect (and ban) such PCDN nodes by checking your uploaded / downloaded ratio. To increase this ratio thus avoid being detected, these people download popular torrents again and again without uploading at all.

horsebridge a year ago

Anybody running a site with data that is useful for AI will learn how horrible bytedance is.

HeralFacker a year ago

Blackhole Bytedance's ASNs. Cobalt is an end-user tool, so there's not much legitimacy to a cloud service accessing it.

seanhunter a year ago

A few big sites that I'm familiar with have seen in the last six months ByteDance become by far the most agressive scraper in their logs.

lawrenceyan a year ago

Who else just found out Cobalt exists from this post? Wow, this is lit.

FatalLogic a year ago

>i can safely assume that bytedance was scraping youtube videos by abusing our private api

I'm not doubting the OP. But why is ByteDance doing this? What does that company get out of scraping YouTube?

jsheard a year ago

Like every other tech giant they're in the AI arms race, and in particular they are building video generation models. It's probably safe to assume they are trying to grab as much of YouTube as possible for use as training material.
https://decrypt.co/284353/tiktok-maker-powerful-ai-video-gen...
sunaookami a year ago

Nothing because the OP is lying. It's not Bytedance, it's Byteplus, their cloud product. Think of it as AWS.

sergiotapia a year ago

why would they use cobalt instead of ytp-dl? is it to mask their origination IPs and such?

ulrischa a year ago

ByteDance ist also massively scraping official governemnetal sites with strange url patterns

jsheard a year ago

@uwukko's full thread for those who don't have a Twitter account:

earlier today i noticed very elevated traffic to cobalt api that looked a lot like ddos. it turned out to be bytedance!

we can't tell what videos they were downloading or where the original request comes from as it's built to go around all limiters, but there's still a pattern

first request: json post with content url & settings from a residential proxy

second request: tunnel with pseudo microsoft edge on windows user agent & youtube origin/referer, from byteplus ip

third request: same tunnel with aria2 user agent & no referer, also from byteplus ip

cobalt is a media downloader, mostly known for supporting youtube even at worst times. cobalt's tunnel is either a proxy stream or ffmpeg live render

considering all of this, i can safely assume that bytedance was scraping youtube videos by abusing our private api

with release of v10 we implemented cloudflare turnstile, but later disabled it due to access issues by a chunk of our users

enabling it back brought the server load to normal levels and stopped bytedance from choking our servers cuz they didn't account for this (yet)

before resorting to turnstile, i attempted using other cloudflare services, but none of them seemed to help much

my theory is that bytedance's scraper was specifically built to go around cloudflare & other web security solutions, which is just genuinely evil

this incident caused a few minutes of api unavailability, but taught me that cobalt (and probably anything else) can no longer exist without active bot/scraping protection

im really glad that cloudflare turnstile exists because i don't know what i'd do without it here

byteplus AS that was spamming requests is 150436 and last seen ip range was 207.166.160.0/21

the amount of unique users on cloudflare analytics rapidly increased by 2.25 times and didn't go down since, while web analytics (plausible) show no increase whatsoever

gnfargbl a year ago

Sounds like they're using residential proxies for set-up in order to look like normal users, but then switching back to their own ASN for content because residential proxies are expensive.
> im really glad that cloudflare turnstile exists because i don't know what i'd do without it here
Why not just blackhole the byteplus ASN?
- sandworm101 a year ago
  
  And how many of those residential IPs belong to work-from-home bytedance employees running work laptops? Any large company these days has direct access to a pool of innocent residential IPs. The weaponization of that pool may be more evil than the actual scraping imho.
miki123211 a year ago

Bytedance seems to have increased its scraping efforts significantly.
I've posted a canary token[1] URL as a Mastodon post, to check how scrape-resistant Mastodon actually is (it is not resistant at all), and have been getting quite a few hits from the ByteDance spider recently.
Last hit is from 47.128.114.151, Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)
Edit: added missing footnote.
[1] https://canarytokens.org
- diggan a year ago
  
  > to check how scrape-resistant Mastodon actually is (it is not resistant at all)
  That's expected, no? It's a social network that is explicitly designed to be as open as possible, as it's using ActivityPub. To be "scraping resisting" would be to go against the very goal of Mastodon.
  - miki123211 a year ago
    
    Yes and no.
    If you look at the technical side of things, you're absolutely right. If you look at the social side, however, there's a lot of talk on there about opting out of scraping, scrapers being bad, not wanting to be part of AI training and so on. Naming-and-shaming people who have been caught scraping is a routine practice.
    I think that many Mastodonians believe that defederating from scraper-friendly instances and blocking scraper-like requests on their own protects them, this was a way to show that this very much isn't true.
  - Aachen a year ago
    
    Exactly, this is how I want it to be. I post there because it's not another walled garden that profits from lock-in