Skip to content

Commit 3764ce7

Browse files
committed
data is now downloaded with wget
1 parent eb6603d commit 3764ce7

File tree

1 file changed

+70
-0
lines changed

1 file changed

+70
-0
lines changed

tutorials/python-programming-3.qmd

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,8 +75,23 @@ I will focus on the `read_csv()` function to demonstrate the general process. Th
7575

7676
This chapter will use data from the Varities of Democracy (VDEM) dataset. VDEM is an ongoing research project to measure the level of democracy in governments around the world and updated versions of the dataset are released on an ongoing basis. The research is led by a team of over 50 social scientists who coordinate the collection and analysis of expert assessments from over 3,200 historians and Country Experts (CEs). From these assessments, the VDEM project has created a remarkably complex array of indicators designed to align with five high-level facets of democracy: electoral, liberal, participatory, deliberative, and egalitarian. The dataset extends back to 1789 and is considered the gold standard of quantitative data about global democratic developments. You can find the full codebook online, and I strongly recommend that you download it and consult it as you work with this data. You can find the full dataset at (https://www.v-dem.net/en/data/data/v-dem-dataset-v11/) and the codebook here (https://www.v-dem.net/media/filer_public/e0/7f/e07f672b-b91e-4e98-b9a3-78f8cd4de696/v-dem_codebook_v8.pdf). The filtered and subsetted version we will use in this book is provided in the `data/vdem` directory of the online learning materials.
7777

78+
> Download the Data
79+
>
80+
> We'll download the data using wget. Run the following command in your terminal:
81+
>
82+
> ```zsh
83+
> wget -O "input/V-Dem-CY-Full+Others-v10.csv" "https://www.dropbox.com/scl/fi/cbsu5do70vh6ud93gvhpo/V-Dem-CY-Full-Others-v10.csv?rlkey=twl5qn5lz4nxrd7s3nertdyws&st=ur9xx7vl&dl=1"
84+
> ```
85+
>
86+
> Now you can load the data from `input/` using the code block below.
87+
7888
Let's load the CSV file into a Pandas `dataframe`.
7989
90+
```{python}
91+
#| echo: false
92+
!wget -O "input/V-Dem-CY-Full+Others-v10.csv" "https://www.dropbox.com/scl/fi/cbsu5do70vh6ud93gvhpo/V-Dem-CY-Full-Others-v10.csv?rlkey=twl5qn5lz4nxrd7s3nertdyws&st=ur9xx7vl&dl=1"
93+
```
94+
8095
```{python}
8196
df = pd.read_csv('input/V-Dem-CY-Full+Others-v10.csv', low_memory=False)
8297
```
@@ -410,6 +425,35 @@ The VDEM data contains an enormous amount of temporal data, but all at the level
410425

411426
Unlike the VDEM data, the Russian Troll Tweets come as a collection of `csv` files. We will use a clever little trick to load up all the data in a single dataframe. The code block below iterates over each file in the `russian-troll-tweets/` subdirectory in the data directory. If the file extension is `csv`, is reads the `csv` into memory as a dataframe. All of the dataframes are then concatenated into a single dataframe containing data on ~ 3M tweets.
412427

428+
> Download the Data
429+
>
430+
> Once again, we'll download the data using wget. Run the following command in your terminal:
431+
> ```zsh
432+
> wget -N -q -O input/russian_troll_tweets.zip "https://www.dropbox.com/scl/fo/a3uxioa2wd7k8x8nas0iy/AH5qjXAZvtFpZeIID0sZ1xA?rlkey=p1471igxmzgyu3lg2x93b3r1y&st=5g04qmt6&dl=1"
433+
> ```
434+
>
435+
> Next, unzip the file into the `input/` subdirectory.
436+
>
437+
> ```zsh
438+
> unzip -o input/russian_troll_tweets.zip -d input/russian-troll-tweets
439+
> rm input/russian_troll_tweets.zip
440+
> ```
441+
> Then you can load the data from `input/` using the code block below.
442+
443+
```{python}
444+
#| echo: false
445+
!wget -N -q -O input/russian_troll_tweets.zip "https://www.dropbox.com/scl/fo/a3uxioa2wd7k8x8nas0iy/AH5qjXAZvtFpZeIID0sZ1xA?rlkey=p1471igxmzgyu3lg2x93b3r1y&st=5g04qmt6&dl=1"
446+
447+
!unzip -o input/russian_troll_tweets.zip -d input/russian-troll-tweets
448+
```
449+
450+
Let's clean up the zip file.
451+
452+
```{python}
453+
!rm input/russian_troll_tweets.zip
454+
```
455+
456+
413457
```{python}
414458
import os
415459
data_dir = os.listdir("input/russian-troll-tweets/")
@@ -423,6 +467,21 @@ tweets_df = pd.concat((pd.read_csv(
423467
tweets_df.info()
424468
```
425469

470+
471+
> There are more than 2.9M tweets in this dataset, and 21 columns. Let's run this code with a 10% random sample. You can increase (or decrease) the sample size if you wish.
472+
>
473+
> ```{python}
474+
> sample_size = 0.1
475+
> tweets_df = tweets_df.sample(frac=sample_size)
476+
> ```
477+
>
478+
479+
```{python}
480+
#| echo: false
481+
sample_size = 0.1
482+
tweets_df = tweets_df.sample(frac=sample_size)
483+
```
484+
426485
As you can see, we have two datatypes in our dataframe: `object` and `int64`. Remember that Pandas uses `object` to refer to columns that contain strings, or which contain mixed types, such as strings and integers. In this case, they refer to strings.
427486

428487
One further thing to note about this dataset: each row is a tweet from a specific account, but some of the variables describe attributes of the tweeting accounts, not of the tweet itself. For example, followers describes the number of followers that the account had at the time it sent the tweet. This makes sense, because tweets don't have followers, but accounts do. We need to keep this in mind when working with this dataset.
@@ -575,6 +634,17 @@ When we concatenate the two dataframes the number of columns stays the same but
575634

576635
An alternative way to combine datasets is to **merge** them. If you want to create a dataframe that contains columns from multiple datasets but is aligned on rows according to some column (or set of columns), you probably want to use the `merge()` function. To illustrate this, we will work with data from two different sources. The first is the VDEM data we used in first part of this chapter (`fsdf`). The second is a dataset from Freedom House on levels of internet freedom in 65 countries. More information is available at https://freedomhouse.org/countries/freedom-net/scores.
577636

637+
> Download the data.
638+
>
639+
>
640+
>
641+
>
642+
643+
```{python}
644+
#| echo: false
645+
!wget -O "input/freedom_house/internet_freedoms_2020.csv" "https://www.dropbox.com/scl/fi/ewdlcxvubzpko6hu32583/internet_freedoms_2020.csv?rlkey=yzc5sbgmpk3nnn2g1u0sg9146&st=uob3s1rt&dl=1"
646+
```
647+
578648
```{python}
579649
freedom_df = pd.read_csv( "input/freedom_house/internet_freedoms_2020.csv")
580650
```

0 commit comments

Comments
 (0)