You will be creating a bash script that will scrape the URLs of tweets from Twitter Bookmarks and saves them in a markdown file.
You will be creating a bash script that will scrape the URLs of tweets from Twitter Bookmarks and saves them in a markdown file.
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. In general, web data extraction is used by people and businesses who want to make use of the vast amount of publicly available web data to make smarter decisions.
In this project we will be scraping Twitter and collecting the URL of the tweets from the Bookmarks page and we will be doing this with simple command line tools such as cURL and jq.
[Disclaimer: Web Scraping has to be used only for learning purposes. Any other attempts to use the data scraped might result in legal action or Your IP might get blocked.]
The project consists of the following stages:
You will be creating a bash script that will scrape the URLs of tweets from Twitter Bookmarks and saves them in a markdown file.
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. In general, web data extraction is used by people and businesses who want to make use of the vast amount of publicly available web data to make smarter decisions.
In this project we will be scraping Twitter and collecting the URL of the tweets from the Bookmarks page and we will be doing this with simple command line tools such as cURL and jq.
[Disclaimer: Web Scraping has to be used only for learning purposes. Any other attempts to use the data scraped might result in legal action or Your IP might get blocked.]
The project consists of the following stages:
First we need to start the Browser Session so that we can give commands to that Browser Session and then work with the output given by it. First we need to download the ChromeDriver and then run the ChromeDriver Server as a background process so that we can give further commands to the shell on the same tab.
cURL
command and save the sessionID in a variable as it will be needed in every command you will use.jq
command for parsing the JSON.You would be able to see the Chrome Browser(if headless is not passed as argument when creating a Browser Session) pop up in front of your screen and terminal waiting for the next command.
It would look like this:
Before we go to the bookmarks section of Twitter and start Scraping we first need to sign in into our Twitter Account. For doing so first we need to go https://twitter.com/login and then select the input fields and put appropriate values into that fields and then search and click the submit button.
After completing the requirements, you would be signed in with your Twitter Account and can now move on to the next module where we do scraping.
In this unit we will scrape the URL of the tweets present on https://twitter.com/i/bookmarks. For this first we need to move to this page and then smartly locate the elements where URL of the tweets are present.
After fulfilling all the requirements you would be able to grab URL of the tweets present on the screen.
After fulfilling all the requirements you would be able to grab URLs of all the tweets that will be saved in a markdown file.
In this section you will be using associative arrays since when you scroll down you get the same element twice or thrice depending upon how much you scroll down. So with the help of associative array (can be used as set data structure in Bash) we can keep only unique URLs in the array. After that we will save them in a markdown file.
uniq
command of bash)After fulfilling all the requirements you would be able to save the URL of all the tweets in a markdown file.