NOTE 10/08/2020: I came up with a better way of doing this.
This post is on my blog. A Ghost blog.
This is my Instagram which I have been posting to more often recently as a result of my Tomrades challenge - my challenge to run Comrades 2019 in less than 7 hours and 30 minutes.
I have wanted for a while to synchronise my social media posts with my blog such that interested parties can see what I am up to across platforms. At the moment I am using Instagram more than Twitter for example. I am almost certainly missing stuff that people are posting on Twitter as a result.
Whilst I can not help them, I can help people who are interested in following my life. If you want to find out what I am up to, what I am reading/doing/thinking/running you will be able to find it all on thomasclowes.com even if you do not follow me on those individual platforms.
Challenge
To essentially copy any/all posts that I make on Instagram to this blog.
This broke down to a few requirements:
- Find an appropriate way to scrape my Instagram profile and/or otherwise pull details of my recent Instagram posts and their meta data.
- Upload the Instagram images to my own server.
- Find a way to create 'Ghost' posts.
Scraping
I stumbled upon this blog post which details someones previous effort to do something similar to what I want to achieve with Python. He utilised the Instagram API to pull data from his Instagram. Unfortunately Instagram's API is way more restrictive nowadays due to data privacy laws and as such it is not an option for this. Of course, if it were I would not have jumped straight to scraping.
I have historically built scrapers to scrape content from the Internet. Given how big Instagram is and how many resources they have at hand I made the assumption that they'd have the means to thwart my scraping efforts. A quick Google made it apparent that they regularly change their page formatting and so I decided to look for an alternative method.
I found websta which produces RSS feeds of Instagram profiles. This was ideal as there are already numerous tools for working with and parsing RSS feeds in multiple programming languages. Unfortunately I noticed that they were having an issue whereby all of their post timestamps were incorrect and as such I ended up using QueryFeed.
QueryFeed provides my Instagram data in an easily consumable format. They have done the heavy lifting for me and pulled out the appropriate data for me. If Instagram change their page structure QueryFeed will (hopefully) handle it. I just need to read the feed, copy the images, and post them to my blog.
Copying the images
This was fairly simple. I opted to build the sync service using node. I was able to upload the images to my own server by piping the response of a request for the image to a write stream using fs
.
request(imageUrl)
.pipe(fs.createWriteStream(fullFilename))
Duplicates
Whilst primitive I wanted my script to run standalone without dependencies on any other systems (databases etc). I opted to discern if a post had been synced based on the presence of an image with the appropriate filename on my filesystem.
This makes the possibly incorrect assumption that if I successfully copy the image to my server I also succeed in posting it to my blog. Primitive.
Posting to Ghost.
Ghost has an API. Unfortunately it is very basic, and not well documented. I can not really complain however - Ghost is open source and their team make a very explicit point of the whole "If you don't like it make a pull request".
I discerned what to submit to the Ghost API by inspecting the network requests being made from the Ghost admin panel. One thing that was not immediately obvious to me was the fact that creating and updating posts seemingly submits complete model objects with every request. Looking at the source code I discerned that Ghost uses the bookshelf.js ORM that I have no experience with.
Whilst 'playing' I stumbled upon various errors which I am confident are ORM/synchronicity related issues. Most of the issues pertained to submitting tags - I wanted to simply specify tag names and have the backend reconcile the relationships behind the scenes. For example I stumbled upon the (seemingly unresolved) error outlined in this issue pertaining to tag - post relationships when submitting multiple posts simultaneously.
Ideally one could just submit all the posts in one request anyway, but unfortunately whilst the API does take an array of post objects in only considers the first post in the aforementioned array.
Given that this is not a mission critical script keeping a company in business I just played and hacked. Submitting the posts synchronously one by one and specifying only the 'name' property of the tag objects worked and allows for a pretty, easy to follow output to the command line.
mySQL direct
I utilise Ghost with a mySQL backend. I did consider just connecting directly to mySQL and inserting the posts/tags etc into the respective tables. I decided that this would require a more thorough investigation of the Ghost data model and would likely cause integrity issues due to lack of knowledge.
Given that a proper API should be maintained for backwards compatability I have comfort in the knowledge that as/when changes are made to Ghost and its API my script should still work.
Without further ado
So.. this is what I came up with. It is not clean, nor is it concise but it gets the job done. Have a play, improve it as you see fit, and enjoy syncing your Instagram posts with your Ghost blog.