9
9
10
10
### This library uses undocumented YouTube API, so it's possible that it will stop working at any time. Use at your own risk.
11
11
12
- > ** Note:** If you want to use this library on Android platform, refer to
12
+ > ** Note:** If you want to use this library on an Android platform, refer to
13
13
> [ Android compatibility] ( #-android-compatibility ) .
14
14
15
15
## 📖 Introduction
16
16
17
17
Java library which allows you to retrieve subtitles/transcripts for a YouTube video.
18
18
It supports manual and automatically generated subtitles, bulk transcript retrieval for all videos in the playlist or
19
- on the channel and does not use headless browser for scraping.
19
+ on the channel and does not use a headless browser for scraping.
20
20
Inspired by [ Python library] ( https://github.com/jdepoix/youtube-transcript-api ) .
21
21
22
22
## ☑️ Features
@@ -60,6 +60,15 @@ implementation 'io.github.thoroldvix:youtube-transcript-api:0.3.6'
60
60
implementation(" io.github.thoroldvix:youtube-transcript-api:0.3.6" )
61
61
```
62
62
63
+ ## ❗ IMPORTANT ❗
64
+
65
+ YouTube has started blocking most IPs that belong to cloud providers (like AWS, Google Cloud Platform, Azure, etc.),
66
+ which means you most likely will get access errors when deploying to any cloud solution. It is also possible that
67
+ YouTube will block you even if you run it locally, it will happen if you make too many requests, mainly when
68
+ using [ bulk transcript retrieval] ( #bulk-transcript-retrieval ) .
69
+ To avoid this, you will need to use rotating proxies like [ Webshare] ( https://www.webshare.io/?referral_code=g0ylrg6pzy7f ) (referral link) or similar solutions.
70
+ You can read on how to make a library use your proxy [ here] ( #youtubeclient-customization-and-proxy ) .
71
+
63
72
## 🔰 Getting Started
64
73
65
74
To start using YouTube Transcript API, you need to create an instance of ` YoutubeTranscriptApi ` by
@@ -81,15 +90,15 @@ for [finding specific transcripts](#find-transcripts) by language or by type (ma
81
90
``` java
82
91
TranscriptList transcriptList = youtubeTranscriptApi. listTranscripts(" videoId" );
83
92
84
- // Iterate over transcript list
85
- for (Transcript transcript : transcriptList) {
86
- System . out. println(transcript);
93
+ // Iterate over a transcript list
94
+ for (Transcript transcript : transcriptList){
95
+ System . out. println(transcript);
87
96
}
88
97
89
98
// Find transcript in specific language
90
99
Transcript transcript = transcriptList. findTranscript(" en" );
91
100
92
- // Find manually created transcript
101
+ // Find a manually created transcript
93
102
Transcript manualyCreatedTranscript = transcriptList. findManualTranscript(" en" );
94
103
95
104
// Find automatically generated transcript
@@ -138,18 +147,19 @@ TranscriptContent transcriptContent = youtubeTranscriptApi.listTranscripts("vide
138
147
Given that English is the most common language, you can omit the language code, and it will default to English:
139
148
140
149
``` java
141
- // Retrieve transcript content in english
150
+ // Retrieve transcript content in English
142
151
TranscriptContent transcriptContent = youtubeTranscriptApi. listTranscripts(" videoId" )
143
- // no language code defaults to english
144
- .findTranscript()
145
- .fetch();
152
+ // no language code defaults to English
153
+ .findTranscript()
154
+ .fetch();
146
155
// Or
147
156
TranscriptContent transcriptContent = youtubeTranscriptApi. getTranscript(" videoId" );
148
157
```
149
158
150
159
For bulk transcript retrieval see [ Bulk Transcript Retrieval] ( #bulk-transcript-retrieval ) .
151
160
152
161
## 🤖 Android compatibility
162
+
153
163
This library uses Java 11 HttpClient for making YouTube requests by default, it was done so it depends on minimal amount
154
164
of 3rd party libraries. Since Android SDK doesn't include Java 11 HttpClient, you will have to implement
155
165
your own ` YoutubeClient ` for it to work.
@@ -160,7 +170,8 @@ You can check how to do it in [YoutubeClient Customization and Proxy](#youtubecl
160
170
161
171
### Use fallback language
162
172
163
- In case if desired language is not available, instead of getting an exception you can pass some other languages that
173
+ In case if the desired language is not available, instead of getting an exception, you can pass some other languages
174
+ that
164
175
will be used as a fallback.
165
176
166
177
For example:
@@ -260,15 +271,14 @@ By default, `YoutubeTranscriptApi` uses Java 11 HttpClient for making requests t
260
271
different client or use a proxy,
261
272
you can create your own YouTube client by implementing the `YoutubeClient ` interface.
262
273
263
- Here is example implementation using OkHttp :
274
+ Here is an example implementation using OkHttp :
264
275
265
276
```java
266
277
public class OkHttpYoutubeClient implements YoutubeClient {
267
-
268
278
private final OkHttpClient client;
269
279
270
280
public OkHttpYoutubeClient () {
271
- this . client = new OkHttpClient ();
281
+ this . client = new OkHttpClient ();
272
282
}
273
283
274
284
@Override
@@ -278,67 +288,61 @@ public class OkHttpYoutubeClient implements YoutubeClient {
278
288
.url(url)
279
289
.build();
280
290
281
- return sendGetRequest (request);
291
+ return executeRequest (request);
282
292
}
283
293
284
294
@Override
285
- public String get (YtApiV3Endpoint endpoint , Map<String , String > params ) throws TranscriptRetrievalException {
295
+ public String post (String url , String json ) throws TranscriptRetrievalException {
296
+ RequestBody requestBody = RequestBody . create(json, MediaType . parse(" application/json; charset=utf-8" ));
297
+
286
298
Request request = new Request .Builder ()
287
- .url(endpoint. url(params))
299
+ .url(url)
300
+ .post(requestBody)
288
301
.build();
289
302
290
- return sendGetRequest (request);
303
+ return executeRequest (request);
291
304
}
292
305
293
- private String sendGetRequest (Request request ) throws TranscriptRetrievalException {
306
+ private String executeRequest (Request request ) throws TranscriptRetrievalException {
294
307
try (Response response = client. newCall(request). execute()) {
295
308
if (response. isSuccessful()) {
296
- ResponseBody body = response. body();
297
- if (body == null ) {
309
+ ResponseBody responseBody = response. body();
310
+ if (responseBody == null ) {
298
311
throw new TranscriptRetrievalException (" Response body is null" );
299
312
}
300
- return body . string();
313
+ return responseBody . string();
301
314
}
302
315
} catch (IOException e) {
303
- throw new TranscriptRetrievalException (" Failed to retrieve data from YouTube " , e);
316
+ throw new TranscriptRetrievalException (" HTTP request failed " , e);
304
317
}
305
- throw new TranscriptRetrievalException (" Failed to retrieve data from YouTube" );
318
+
319
+ throw new TranscriptRetrievalException (" HTTP request failed with non-successful response" );
306
320
}
307
321
}
308
322
```
309
- After implementing your custom ` YouTubeClient ` you will need to pass it to ` TranscriptApiFactory ` ` createWithClient ` method.
323
+
324
+ After implementing your custom ` YouTubeClient ` you will need to pass it to ` TranscriptApiFactory ` ` createWithClient `
325
+ method.
310
326
311
327
``` java
312
328
YoutubeClient okHttpClient = new OkHttpYoutubeClient ();
313
329
YoutubeTranscriptApi youtubeTranscriptApi = TranscriptApiFactory . createWithClient(okHttpClient);
314
330
```
315
331
316
332
### Cookies
317
-
318
- Some videos may be age-restricted, requiring authentication to access the transcript.
319
- To achieve this, obtain access to the desired video in a browser and download the cookies in Netscape format, storing
320
- them as a TXT file.
321
- You can use extensions
322
- like [ Get cookies.txt LOCALLY] ( https://chromewebstore.google.com/detail/get-cookiestxt-locally/cclelndahbckbenkjhflpdbgdldlbecc )
323
- for Chrome or [ cookies.txt] ( https://addons.mozilla.org/en-US/firefox/addon/cookies-txt/ ) for Firefox to do this.
324
- ` YoutubeTranscriptApi ` contains ` listTranscriptsWithCookies ` and ` getTranscriptWithCookies ` which accept a path to the
325
- cookies.txt file.
326
-
327
- ``` java
328
- // Retrieve transcript list
329
- TranscriptList transcriptList = youtubeTranscriptApi. listTranscriptsWithCookies(" videoId" , " path/to/cookies.txt" );
330
-
331
- // Get transcript content
332
- TranscriptContent transcriptContent = youtubeTranscriptApi. getTranscriptWithCookies(" videoId" , " path/to/cookies.txt" , " en" );
333
- ```
333
+ Some videos are age-restricted, so this library won't be able to access those videos without some sort of authentication.
334
+ Unfortunately, some recent changes to the YouTube API have broken the current implementation of cookie-based
335
+ authentication, so this feature is currently not available.
334
336
335
337
### Bulk Transcript Retrieval
336
338
337
- There are a few methods for bulk transcript retrieval in ` YoutubeTranscriptApi `
339
+ #### ❗You will most likely get [ IP blocked] ( #-important- ) by YouTube if you use this❗
340
+
341
+ There are a few methods for bulk transcript retrieval in ` YoutubeTranscriptApi `
338
342
339
- Playlists and channels information is retrieved from
343
+ Playlists and channels information are retrieved from
340
344
the [ YouTube V3 API] ( https://developers.google.com/youtube/v3/docs/ ) ,
341
- so you will need to provide API key for all methods.
345
+ so you will need to provide an API key for all methods.
342
346
343
347
All methods take a ` TranscriptRequest ` object as a parameter,
344
348
which contains the following fields:
@@ -348,8 +352,6 @@ which contains the following fields:
348
352
fail fast by throwing an error if one of the transcripts could not be retrieved,
349
353
otherwise it will ignore failed transcripts.
350
354
351
- - ` cookies ` (optional) - Path to [ cookies.txt] ( #cookies ) file.
352
-
353
355
All methods return a map which contains the video ID as a key and the corresponding result as a value.
354
356
355
357
``` java
@@ -426,10 +428,28 @@ undocumented API URL embedded within its HTML. This JSON looks like this:
426
428
}
427
429
```
428
430
429
- This library works by making a single GET request to the YouTube page of the specified video, extracting the JSON data
430
- from the HTML, and parsing it to obtain a list of all available transcripts. To fetch the transcript content, it then
431
- sends a GET request to the API URL extracted from the JSON. The YouTube API returns the transcript content in XML
432
- format, like this:
431
+ Before you could directly extract this JSON from video page HTML and call extracted API URL, but YouTube fixed this by
432
+ not allowing
433
+ requests to the URL that is embedded in this JSON,
434
+ but there is a workaround. Each video page also contains an INNERTUBE_API_KEY field, which can be used to access
435
+ internal YouTube API. Because of this you can make POST request to this URL
436
+ ` https://www.youtube.com/youtubei/v1/player?key=INNERTUBE_API_KEY ` with a body like this:
437
+
438
+ ``` json
439
+ {
440
+ "context" : {
441
+ "client" : {
442
+ "clientName" : " ANDROID" ,
443
+ "clientVersion" : " 20.10.38"
444
+ }
445
+ },
446
+ "videoId" : " dQw4w9WgXcQ"
447
+ }
448
+ ```
449
+
450
+ To retrieve JSON that is similar to the JSON contained in the video page HTML. Extracted API URL is then
451
+ called to retrieve the content of the transcript,
452
+ it has an XML format and looks like this
433
453
434
454
``` xml
435
455
<?xml version =" 1.0" encoding =" utf-8" ?>
0 commit comments