An important aspect of data engineering is leveraging drives (could based repository of virtually anything that can be stored) whether it is to keep a log of changes that occur, create files, update files, create folder… Although it is something that is widely used today for data engineers and web-developers, data scientists and data analysts are most of the times out of shape on those subjects. This post is meant to give our secret recipe to connect with four of the main drives in the market using Python and its libraries:
SharePoint
Dropbox
Google Drive
Citrix ShareFile
Prerequisites
You will need to have python installed on your machine. The rest should flow naturally in the post. We are using the following version of python:
Python 3.7.2 [MSC v.1916 64 bit (AMD64)] on win32
SharePoint
SharePoint is the drive from Microsoft. Many of us have given up on following the numerous changes undertaken by Microsoft for this product. We will focus on performing our work on SharePoint online. We have a SharePoint included with our office package; hence we were able to leverage this. Usually, the name of the website follows this pattern:
www.tenant.sharepoint.com/sites/group.
Tenant: it is the name of your company under Microsoft
Group: it is the name you gave to the group you created in the SharePoint website.
SharePoint - Python libraries
In order to install libraries, we are using the pip install functionally that can be accessed using command prompt (for Windows) or terminal (for Mac). We will need to install two libraries: pandas and sharepy.
Once on your python script, you can start by calling the needed libraries:
SharePoint - Connection
In order to connect to the SharePoint group, we will use the module “connect” of the sharepy library you installed and write the following line of code:
SharePoint - Get list of all the documents
This script is meant to read in keep in memory the list of the documents that are currently in the SharePoint group. In order to make it easy, we will write a function that will return a list of those documents.
In the example we are taking, connection is the sharepy.session.SharePointSession object you created in the previous step. Then, site is the entire url to get to your group (for example: 'www.tenant.sharepoint.com/sites/groupname')
Finally the library is the name of the library you wish to scan. You might want to use library = 'Documents' if you have not changed the structure of the SharePoint.
SharePoint - Download and read excel file
This script is meant to download any excel file from the SharePoint drive and read it as a pandas data frame. In this example, we assume that we want to read a file in the roots of the Documents folder (not in subfolders).
connection and site remain the same as what has been used before. The file_name must match your excel file. This will download the excel file in folder where this script is run from, then read it as a pandas data frame.
SharePoint - Create a folder
This script is meant to create a folder anywhere in the existing arborescence. You will note that this function does not return anything. Since we wish to create a folder and for this simple example, we did not deem it necessary to return anything from this function.
connection and site remain the same as what has been used before.
parent_path: It can be the following, assuming that you have already a folder called “Folder1”: 'Shared Documents/Folder1/'
folder_name: pick any name that you wish, for instance 'new folder'
SharePoint - Upload an excel file
This script is meant to upload an excel file that is in the same folder as the script you are running.
connection and site remain the same as what has been used before.
file_name: use the name of you excel file without the extension.
parent_path: It can be the following, if you have already a folder called “Folder1”: 'Shared Documents/Folder1/'
Dropbox
Dropbox is a pure player in the market of cloud repository. It is interesting because of its freemium structure and a strong security architecture.
Dropbox - Python libraries
In order to install libraries, we are using the pip install functionally that can be accessed using command prompt (for Windows) or terminal (for Mac). We will need to install two libraries: pandas and dropbox.
Once on your python script, you can start by calling the needed libraries:
Dropbox - Connection
The first step is to get your access_token. Please check this page to do so: https://www.dropbox.com/developers/documentation/python#tutorial. In order to connect to the Dropbox folder, we will use the module “Dropbox” of the dropbox library you installed and write the following line of code:
Dropbox - Get list of all the documents
This script is meant to read in keep in memory the list of the documents that are currently in the Dropbox folder. In order to make it easy, we will write a function that will return a list of those documents.
In the example we are taking, dbx is the object you created in the previous step.
Finally the path is the name of the folder you wish to scan. For instance, you could use the following path = '/folder1'.
Dropbox - Read excel file
This script is meant to read any excel file from the Dropbox drive as a pandas data frame.
In the example we are taking, dbx is the object you created in the first step.
parent_path is the name of the folder where the excel file is located. For instance, you could use the following parent_path = '/folder1/'.
file is the name of the excel file, for instance 'dropboxlist'.
sheet_name is the name of the sheet you wish to read in memory from the excel file selected.
Dropbox - Upload an excel file
This script is meant to upload an excel file that is in the same folder as the script you are running.
In the example we are taking, dbx is the object you created in the first step.
parent_path is the name of the folder where you wish to send the file. For instance, you could use the following parent_path = '/folder1/'.
file is the name of the excel file (we assume here that it is located in the same folder as the script you are running), for instance 'dropboxlist'.
sheet_name is the name of the sheet you wish to read in memory from the excel file selected.
Google Drive
Google Drive is the drive from Google (big news right?). This drive is a classic in the market since it is included any time a "gmail" address is created. Nonetheless, a professional use is less spread because it is often seen as a non-secure drive (which I do not agree with). It offers very solid capabilities such as double-sign-on (with email or with text), encryption, folder/file access management... It is mostly free (unless you need more than 30G, which is an important aspect to consider.
Google Drive - Python libraries
In order to install libraries, we are using the pip install functionally that can be accessed using command prompt (for Windows) or terminal (for Mac). We will need to install two libraries: pandas, googleapiclient and oauth2client.
Once on your python script, you can start by calling the needed libraries:
Google Drive - Connection
The first step is to create a service account that you can give access to. A service account can be seen as a other entity that will be allowed to have access to some of your data. Just like you would give access to another employee or a friend of yours. If you do not share a file or a folder to this new entity, then they would not have access to it. The video that follows shows you how to create a service account (here mode VIEWER) and retrieve the security json key. Please note that for the examples we have below, a mode EDITOR should be used.
You can then save the json key to the folder where your script runs. To create the connection and to activate the needed scopes, we run the following function:
Google Drive - Get list of all the documents
This script is meant to read in keep in memory the list of the documents that are currently in the Google Drive folder. In order to make it easy, we will write a function that will return a list of those documents.
service comes from the connection script we have described above.
Google Drive - Read excel file
This script is meant to read any excel file from the Google Drive as a pandas data frame.
sheets comes from the connection script we have described above.
spreadsheet_id is the ID that google assigns to each file.
range_name is the range you wish to read. For instance, you could set range_name = 'A1:Z1000'.
Google Drive - Create a folder
This script is meant to create a folder anywhere in the Google Drive.
service comes from the connection script we have described above.
folder_name is the name you wish to assign the folder.
folder_id is the ID that google assigns to each file. In this case, you want to use the ID of the parent folder where you want the newly created folder to be located.
Google Drive - Upload an excel file
This script is meant to upload an excel file anywhere in the Google Drive.
service comes from the connection script we have described above.
file_name is the name you wish to assign the file.
parent_folder_id is the ID that google assigns to each file. In this case, you want to use the ID of the parent folder where you want the newly created folder to be located.
df is the dataframe you wish to upload to Google (for an excel file, you can use the following code: df = pd.read_excel('excel_file.xlsx')).
Citrix ShareFile
Amongst the top cloud drive in the market, this drive remains the least known. It is mostly targeting companies.
Citrix ShareFile - Python libraries
In order to install libraries, we are using the pip install functionally that can be accessed using command prompt (for Windows) or terminal (for Mac). We will need to install use libraries. They can al be pip installed. Here is the list of the needed libraries: (some are already in the standard python library):
Citrix ShareFile - Connection
In order to connect to the REST API of this drive, we need first to generate a key. We need to login to Citrix on the following url: https://api.sharefile.com/rest/ and go to the key generator as follow:
The connection is made by getting the token and we will then use this token for the future request.
hostname: 'tenant.sharefile.com'
client_id: see step detailed above
client_secret: see step detailed above
username: email address used in Citrix ShareFile
password: password used in Citrix ShareFile
Citrix ShareFile - Get list of all the documents
This script is meant to read in keep in memory the list of the documents that are currently in the Citrix ShareFile shared folder. In order to make it easy, we will write a function that will return a list of those documents.
token is coming from the previously built function.
Citrix ShareFile - Download and read excel file
This script is meant to download any excel file from the Citrix ShareFile drive and read it as a pandas data frame.
token is coming from the previously built function.
item_id is the ID that citrix assigns to each file.
local_path is the location where the file will be saved.
Citrix ShareFile - Create a folder
This script is meant to create a folder anywhere in the existing arborescence. You will note that this function does not return anything. Since we wish to create a folder and for this simple example, we did not deem it necessary to return anything from this function.
token is coming from the previously built function.
parent_id is the ID that citrix assigns to each file. Here you want the id of the parent folder.
name is the name of the folder you are creating.
description is the description of the folder you are creating.
Citrix ShareFile - Upload an excel file
This script is meant to upload an excel file that is in the same folder as the script you are running.
token is coming from the previously built function.
parent_id is the ID that citrix assigns to each file. Here you want the id of the parent folder.
local_path is the location where the file is located.
Want more information?
Please contact us at ask@simply-bi.com
Comentarios