scrape_linkedin is a python package to scrape all details from public LinkedIn
profiles, turning the data into structured json. You can scrape Companies
and user profiles with this package.
Warning: LinkedIn has strong anti-scraping policies, they may blacklist ips making unauthenticated or unusual requests
Run pip install git+git://github.com/austinoboyle/scrape-linkedin-selenium.git
git clone https://github.com/austinoboyle/scrape-linkedin-selenium.git
Run python setup.py install
Tests are (so far) only run on static html files. One of which is a linkedin profile, the other is just used to test some utility functions.
Because of Linkedin's anti-scraping measures, you must make your selenium browser look like an actual user. To do this, you need to add the li_at cookie to the selenium session.
- Navigate to www.linkedin.com and log in
 - Open browser developer tools (Ctrl-Shift-I or right click -> inspect element)
 - Select the appropriate tab for your browser (Application on Chrome, Storage on Firefox)
 - Click the Cookies dropdown on the left-hand menu, and select the
www.linkedin.comoption - Find and copy the li_at value
 
There are two ways to set your li_at cookie:
- Set the LI_AT environment variable
$ export LI_AT=YOUR_LI_AT_VALUE- On Windows: 
C:/foo/bar> set LI_AT=YOUR_LI_AT_VALUE 
 - Pass the cookie as a parameter to the Scraper object.
>>> with ProfileScraper(cookie='YOUR_LI_AT_VALUE') as scraper: 
A cookie value passed directly to the Scraper will override your environment variable if both are set.
See /examples
scrape_linkedin comes with a command line argument module scrapeli created
using click.
Note: CLI only works with Personal Profiles as of now.
Options:
- --url : Full Url of the profile you want to scrape
 - --user: www.linkedin.com/in/USER
 - --driver: choose Browser type to use (Chrome/Firefox), default: Chrome
 - -a --attribute : return only a specific attribute (default: return all attributes)
 - -i --input_file : Raw path to html file of the profile you want to scrape
 - -o --output_file: Raw path to output file for structured json profile (just prints results by default)
 - -h --help : Show this screen.
 
Examples:
- Get Austin O'Boyle's profile info: 
$ scrapeli --user=austinoboyle - Get only the skills of Austin O'Boyle: 
$ scrapeli --user=austinoboyle -a skills - Parse stored html profile and save json output: 
$ scrapeli -i /path/file.html -o output.json 
Use ProfileScraper component to scrape profiles.
from scrape_linkedin import ProfileScraper
with ProfileScraper() as scraper:
    profile = scraper.scrape(user='austinoboyle')
print(profile.to_dict())Profile - the class that has properties to access all information pulled from
a profile. Also has a to_dict() method that returns all of the data as a dict
with open('profile.html', 'r') as profile_file:
    profile = Profile(profile_file.read())
print (profile.skills)
# [{...} ,{...}, ...]
print (profile.experiences)
# {jobs: [...], volunteering: [...],...}
print (profile.to_dict())
# {personal_info: {...}, experiences: {...}, ...}
Structure of the fields scraped
- personal_info
- name
 - company
 - school
 - headline
 - followers
 - summary
 - websites
 - phone
 - connected
 - image
 
 - skills
 - experiences
- volunteering
 - jobs
 - education
 
 - interests
 - accomplishments
- publications
 - cerfifications
 - patents
 - courses
 - projects
 - honors
 - test scores
 - languages
 - organizations
 
 
Use CompanyScraper component to scrape companies.
from scrape_linkedin import CompanyScraper
with CompanyScraper() as scraper:
    company = scraper.scrape(company='facebook')
print(company.to_dict())Company - the class that has properties to access all information pulled from
a company profile. There will be three properties: overview, jobs, and life.
Overview is the only one currently implemented.
with open('overview.html', 'r') as overview,
    open('jobs.html', 'r') as jobs,
    open('life.html', 'r') as life:
        company = Company(overview, jobs, life)
print (company.overview)
# {...}
Structure of the fields scraped
- overview
- name
 - company_size
 - specialties
 - headquarters
 - founded
 - website
 - description
 - industry
 - num_employees
 - type
 - image
 
 - jobs NOT YET IMPLEMENTED
 - life NOT YET IMPLEMENTED
 
Pass these keyword arguments into the constructor of your Scraper to override default values. You may (for example) want to decrease/increase the timeout if your internet is very fast/slow.
- cookie 
{str}: li_at cookie value (overrides env variable)- default: 
None 
 - default: 
 - driver 
{selenium.webdriver}: driver type to use- default: 
selenium.webdriver.Chrome 
 - default: 
 - driver_options 
{dict}: kwargs to pass to driver constructor- default: 
{} 
 - default: 
 - scroll_pause 
{float}: time(s) to pause during scroll increments- default: 
0.1 
 - default: 
 - scroll_increment 
{int}num pixels to scroll down each time- default: 
300 
 - default: 
 - timeout 
{float}: default time to wait for async content to load- default: 
10 
 - default: 
 
New in version 0.2: built in parallel scraping functionality. Note that the up-front cost of starting a browser session is high, so in order for this to be beneficial, you will want to be scraping many (> 15) profiles.
from scrape_linkedin import scrape_in_parallel, CompanyScraper
companies = ['facebook', 'google', 'amazon', 'microsoft', ...]
#Scrape all companies, output to 'companies.json' file, use 4 browser instances
scrape_in_parallel(
    scraper_type=CompanyScraper,
    items=companies,
    output_file="companies.json",
    num_instances=4
)Parameters:
- scraper_type 
{scrape_linkedin.Scraper}: Scraper to use - items 
{list}: List of items to be scraped - output_file 
{str}: path to output file - num_instances 
{int}: number of parallel instances of selenium to run - temp_dir 
{str}: name of temporary directory to use to store data from intermediate steps- default: 'tmp_data'
 
 - driver {selenium.webdriver}: driver to use for scraping
- default: selenium.webdriver.Chrome
 
 - driver_options 
{dict}: dict of keyword arguments to pass to the driver function.- default: scrape_linkedin.utils.HEADLESS_OPTIONS
 
 - **kwargs 
{any}: extra keyword arguments to pass to thescraper_typeconstructor for each job 
Report bugs and feature requests here.