Skip to content

Commit 7ac7a4a

Browse files
authored
refactor: innertube api refactor
1 parent 5621a46 commit 7ac7a4a

File tree

59 files changed

+10665
-39579
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

59 files changed

+10665
-39579
lines changed

README.md

Lines changed: 71 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,14 @@
99

1010
### This library uses undocumented YouTube API, so it's possible that it will stop working at any time. Use at your own risk.
1111

12-
> **Note:** If you want to use this library on Android platform, refer to
12+
> **Note:** If you want to use this library on an Android platform, refer to
1313
> [Android compatibility](#-android-compatibility).
1414
1515
## 📖 Introduction
1616

1717
Java library which allows you to retrieve subtitles/transcripts for a YouTube video.
1818
It supports manual and automatically generated subtitles, bulk transcript retrieval for all videos in the playlist or
19-
on the channel and does not use headless browser for scraping.
19+
on the channel and does not use a headless browser for scraping.
2020
Inspired by [Python library](https://github.com/jdepoix/youtube-transcript-api).
2121

2222
## ☑️ Features
@@ -60,6 +60,15 @@ implementation 'io.github.thoroldvix:youtube-transcript-api:0.3.6'
6060
implementation("io.github.thoroldvix:youtube-transcript-api:0.3.6")
6161
```
6262

63+
## ❗ IMPORTANT ❗
64+
65+
YouTube has started blocking most IPs that belong to cloud providers (like AWS, Google Cloud Platform, Azure, etc.),
66+
which means you most likely will get access errors when deploying to any cloud solution. It is also possible that
67+
YouTube will block you even if you run it locally, it will happen if you make too many requests, mainly when
68+
using [bulk transcript retrieval](#bulk-transcript-retrieval).
69+
To avoid this, you will need to use rotating proxies like [Webshare](https://www.webshare.io/?referral_code=g0ylrg6pzy7f) (referral link) or similar solutions.
70+
You can read on how to make a library use your proxy [here](#youtubeclient-customization-and-proxy).
71+
6372
## 🔰 Getting Started
6473

6574
To start using YouTube Transcript API, you need to create an instance of `YoutubeTranscriptApi` by
@@ -81,15 +90,15 @@ for [finding specific transcripts](#find-transcripts) by language or by type (ma
8190
```java
8291
TranscriptList transcriptList = youtubeTranscriptApi.listTranscripts("videoId");
8392

84-
// Iterate over transcript list
85-
for(Transcript transcript : transcriptList) {
86-
System.out.println(transcript);
93+
// Iterate over a transcript list
94+
for(Transcript transcript : transcriptList){
95+
System.out.println(transcript);
8796
}
8897

8998
// Find transcript in specific language
9099
Transcript transcript = transcriptList.findTranscript("en");
91100

92-
// Find manually created transcript
101+
// Find a manually created transcript
93102
Transcript manualyCreatedTranscript = transcriptList.findManualTranscript("en");
94103

95104
// Find automatically generated transcript
@@ -138,18 +147,19 @@ TranscriptContent transcriptContent = youtubeTranscriptApi.listTranscripts("vide
138147
Given that English is the most common language, you can omit the language code, and it will default to English:
139148

140149
```java
141-
// Retrieve transcript content in english
150+
// Retrieve transcript content in English
142151
TranscriptContent transcriptContent = youtubeTranscriptApi.listTranscripts("videoId")
143-
//no language code defaults to english
144-
.findTranscript()
145-
.fetch();
152+
//no language code defaults to English
153+
.findTranscript()
154+
.fetch();
146155
// Or
147156
TranscriptContent transcriptContent = youtubeTranscriptApi.getTranscript("videoId");
148157
```
149158

150159
For bulk transcript retrieval see [Bulk Transcript Retrieval](#bulk-transcript-retrieval).
151160

152161
## 🤖 Android compatibility
162+
153163
This library uses Java 11 HttpClient for making YouTube requests by default, it was done so it depends on minimal amount
154164
of 3rd party libraries. Since Android SDK doesn't include Java 11 HttpClient, you will have to implement
155165
your own `YoutubeClient` for it to work.
@@ -160,7 +170,8 @@ You can check how to do it in [YoutubeClient Customization and Proxy](#youtubecl
160170

161171
### Use fallback language
162172

163-
In case if desired language is not available, instead of getting an exception you can pass some other languages that
173+
In case if the desired language is not available, instead of getting an exception, you can pass some other languages
174+
that
164175
will be used as a fallback.
165176

166177
For example:
@@ -260,15 +271,14 @@ By default, `YoutubeTranscriptApi` uses Java 11 HttpClient for making requests t
260271
different client or use a proxy,
261272
you can create your own YouTube client by implementing the `YoutubeClient` interface.
262273

263-
Here is example implementation using OkHttp:
274+
Here is an example implementation using OkHttp:
264275

265276
```java
266277
public class OkHttpYoutubeClient implements YoutubeClient {
267-
268278
private final OkHttpClient client;
269279

270280
public OkHttpYoutubeClient() {
271-
this.client = new OkHttpClient();
281+
this.client = new OkHttpClient();
272282
}
273283

274284
@Override
@@ -278,67 +288,61 @@ public class OkHttpYoutubeClient implements YoutubeClient {
278288
.url(url)
279289
.build();
280290

281-
return sendGetRequest(request);
291+
return executeRequest(request);
282292
}
283293

284294
@Override
285-
public String get(YtApiV3Endpoint endpoint, Map<String, String> params) throws TranscriptRetrievalException {
295+
public String post(String url, String json) throws TranscriptRetrievalException {
296+
RequestBody requestBody = RequestBody.create(json, MediaType.parse("application/json; charset=utf-8"));
297+
286298
Request request = new Request.Builder()
287-
.url(endpoint.url(params))
299+
.url(url)
300+
.post(requestBody)
288301
.build();
289302

290-
return sendGetRequest(request);
303+
return executeRequest(request);
291304
}
292305

293-
private String sendGetRequest(Request request) throws TranscriptRetrievalException {
306+
private String executeRequest(Request request) throws TranscriptRetrievalException {
294307
try (Response response = client.newCall(request).execute()) {
295308
if (response.isSuccessful()) {
296-
ResponseBody body = response.body();
297-
if (body == null) {
309+
ResponseBody responseBody = response.body();
310+
if (responseBody == null) {
298311
throw new TranscriptRetrievalException("Response body is null");
299312
}
300-
return body.string();
313+
return responseBody.string();
301314
}
302315
} catch (IOException e) {
303-
throw new TranscriptRetrievalException("Failed to retrieve data from YouTube", e);
316+
throw new TranscriptRetrievalException("HTTP request failed", e);
304317
}
305-
throw new TranscriptRetrievalException("Failed to retrieve data from YouTube");
318+
319+
throw new TranscriptRetrievalException("HTTP request failed with non-successful response");
306320
}
307321
}
308322
```
309-
After implementing your custom `YouTubeClient` you will need to pass it to `TranscriptApiFactory` `createWithClient` method.
323+
324+
After implementing your custom `YouTubeClient` you will need to pass it to `TranscriptApiFactory` `createWithClient`
325+
method.
310326

311327
```java
312328
YoutubeClient okHttpClient = new OkHttpYoutubeClient();
313329
YoutubeTranscriptApi youtubeTranscriptApi = TranscriptApiFactory.createWithClient(okHttpClient);
314330
```
315331

316332
### Cookies
317-
318-
Some videos may be age-restricted, requiring authentication to access the transcript.
319-
To achieve this, obtain access to the desired video in a browser and download the cookies in Netscape format, storing
320-
them as a TXT file.
321-
You can use extensions
322-
like [Get cookies.txt LOCALLY](https://chromewebstore.google.com/detail/get-cookiestxt-locally/cclelndahbckbenkjhflpdbgdldlbecc)
323-
for Chrome or [cookies.txt](https://addons.mozilla.org/en-US/firefox/addon/cookies-txt/) for Firefox to do this.
324-
`YoutubeTranscriptApi` contains `listTranscriptsWithCookies` and `getTranscriptWithCookies` which accept a path to the
325-
cookies.txt file.
326-
327-
```java
328-
// Retrieve transcript list
329-
TranscriptList transcriptList = youtubeTranscriptApi.listTranscriptsWithCookies("videoId", "path/to/cookies.txt");
330-
331-
// Get transcript content
332-
TranscriptContent transcriptContent = youtubeTranscriptApi.getTranscriptWithCookies("videoId", "path/to/cookies.txt", "en");
333-
```
333+
Some videos are age-restricted, so this library won't be able to access those videos without some sort of authentication.
334+
Unfortunately, some recent changes to the YouTube API have broken the current implementation of cookie-based
335+
authentication, so this feature is currently not available.
334336

335337
### Bulk Transcript Retrieval
336338

337-
There are a few methods for bulk transcript retrieval in `YoutubeTranscriptApi`
339+
#### ❗You will most likely get [IP blocked](#-important-) by YouTube if you use this❗
340+
341+
There are a few methods for bulk transcript retrieval in `YoutubeTranscriptApi`
338342

339-
Playlists and channels information is retrieved from
343+
Playlists and channels information are retrieved from
340344
the [YouTube V3 API](https://developers.google.com/youtube/v3/docs/),
341-
so you will need to provide API key for all methods.
345+
so you will need to provide an API key for all methods.
342346

343347
All methods take a `TranscriptRequest` object as a parameter,
344348
which contains the following fields:
@@ -348,8 +352,6 @@ which contains the following fields:
348352
fail fast by throwing an error if one of the transcripts could not be retrieved,
349353
otherwise it will ignore failed transcripts.
350354

351-
- `cookies` (optional) - Path to [cookies.txt](#cookies) file.
352-
353355
All methods return a map which contains the video ID as a key and the corresponding result as a value.
354356

355357
```java
@@ -426,10 +428,28 @@ undocumented API URL embedded within its HTML. This JSON looks like this:
426428
}
427429
```
428430

429-
This library works by making a single GET request to the YouTube page of the specified video, extracting the JSON data
430-
from the HTML, and parsing it to obtain a list of all available transcripts. To fetch the transcript content, it then
431-
sends a GET request to the API URL extracted from the JSON. The YouTube API returns the transcript content in XML
432-
format, like this:
431+
Before you could directly extract this JSON from video page HTML and call extracted API URL, but YouTube fixed this by
432+
not allowing
433+
requests to the URL that is embedded in this JSON,
434+
but there is a workaround. Each video page also contains an INNERTUBE_API_KEY field, which can be used to access
435+
internal YouTube API. Because of this you can make POST request to this URL
436+
`https://www.youtube.com/youtubei/v1/player?key=INNERTUBE_API_KEY` with a body like this:
437+
438+
```json
439+
{
440+
"context": {
441+
"client": {
442+
"clientName": "ANDROID",
443+
"clientVersion": "20.10.38"
444+
}
445+
},
446+
"videoId": "dQw4w9WgXcQ"
447+
}
448+
```
449+
450+
To retrieve JSON that is similar to the JSON contained in the video page HTML. Extracted API URL is then
451+
called to retrieve the content of the transcript,
452+
it has an XML format and looks like this
433453

434454
```xml
435455
<?xml version="1.0" encoding="utf-8" ?>

gradle/libs.versions.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ jackson = "2.17.2"
66
apache-commons-text = "1.12.0"
77
maven-publish = "0.29.0"
88
gradle-release = "3.0.2"
9+
jspecify = "1.0.0"
910

1011
[libraries]
1112
junit-jupiter = { module = "org.junit.jupiter:junit-jupiter", version.ref = "junit" }
@@ -14,6 +15,7 @@ assertj-core = { module = "org.assertj:assertj-core", version.ref = "assertj" }
1415
mockito-junit-jupiter = { module = "org.mockito:mockito-junit-jupiter", version.ref = "mockito" }
1516
jackson-dataformat-xml = { module = "com.fasterxml.jackson.dataformat:jackson-dataformat-xml", version.ref = "jackson" }
1617
apache-commons-text = { module = "org.apache.commons:commons-text", version.ref = "apache-commons-text" }
18+
jspecify = { module = "org.jspecify:jspecify", version.ref = "jspecify"}
1719

1820
[plugins]
1921
maven-publish = { id = "com.vanniktech.maven.publish", version.ref = "maven-publish" }

lib/build.gradle.kts

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@ object Metadata {
66
const val GROUP_ID = "io.github.thoroldvix"
77
const val LICENSE = "MIT"
88
const val LICENSE_URL = "https://opensource.org/licenses/MIT"
9-
const val GITHUB_REPO = "thoroldvix/youtube-transcript-api"
10-
const val DEVELOPER_ID = "thoroldvix"
9+
const val GITHUB_REPO = "trldvix/youtube-transcript-api"
10+
const val DEVELOPER_ID = "trldvix"
1111
const val DEVELOPER_NAME = "Alexey Bobkov"
1212
const val DEVELOPER_EMAIL = "dignitionn@gmail.com"
1313
}
@@ -34,6 +34,7 @@ tasks.getByName<Test>("test") {
3434
dependencies {
3535
implementation(libs.jackson.dataformat.xml)
3636
implementation(libs.apache.commons.text)
37+
implementation(libs.jspecify)
3738

3839
testRuntimeOnly(libs.junit.jupiter.platform.launcher)
3940
testImplementation(libs.junit.jupiter)
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
package io.github.thoroldvix.api;
2+
3+
/**
4+
* Request object for retrieving transcripts.
5+
* <p>
6+
* Contains an API key required for the YouTube V3 API,
7+
* and optionally a file path to the text file containing the authentication cookies. If cookies are not provided, the API will not be able to access age restricted videos.
8+
* Also contains a flag to stop on error or continue on error. Defaults to false if not provided.
9+
* </p>
10+
*/
11+
public class BulkTranscriptRequest {
12+
private final String apiKey;
13+
private final boolean stopOnError;
14+
15+
public BulkTranscriptRequest(String apiKey, boolean stopOnError) {
16+
if (apiKey.isBlank()) {
17+
throw new IllegalArgumentException("API key cannot be empty");
18+
}
19+
this.apiKey = apiKey;
20+
this.stopOnError = stopOnError;
21+
}
22+
23+
public BulkTranscriptRequest(String apiKey) {
24+
this(apiKey, true);
25+
}
26+
27+
/**
28+
* @return API key for the YouTube V3 API (see <a href="https://developers.google.com/youtube/v3/getting-started">Getting started</a>)
29+
*/
30+
public String getApiKey() {
31+
return apiKey;
32+
}
33+
34+
/**
35+
* @return Whether to stop if transcript retrieval fails for a video. If false, all transcripts that could not be retrieved will be skipped,
36+
* otherwise an exception will be thrown on the first error.
37+
*/
38+
public boolean isStopOnError() {
39+
return stopOnError;
40+
}
41+
}

0 commit comments

Comments
 (0)