Download ScienceDirect research papers as XML files in python

Shivani Jadhav
3 min readJan 24, 2021

Elsevier portal

Step 1: Log in to ScienceDirect

Step 2: Go to this link

Step 3: Select the first option “I want an API Key”

Step4: Click on Create API Key

Step5: Fill in some label eg: “NLP” and any random URL http://inf.com

Step 6: Click on agree and then submit

Below the registered API keys you will get the API key. Save the API key in a variable

key="your obtained api key"

All the necessary imports:

import urllib.requestimport shutilimport os

The next task is to get the URLs of the articles which have open access.

Steps to follow:

  1. Go to the following link: https://www.sciencedirect.com/search?accessTypes=openaccess
  2. Choose the filter that you want to apply to get the expected citations and select those documents.
  3. Click on Export and select Export citations to text. You cannot see the “Export” option if you are not signed in.
  4. We get a list of citations and we need to extract the URLs. The following piece of code does that.
def extractURLs(fileContent):     urls = re.findall('\(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|     [!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', fileContent.lower())     return urlsfinalurl=[]myFile = open(text file that contains the exported citations,encoding="utf8")fileContent = myFile.read()finalurl=(extractURLs(fileContent))

5. Each value in finalurl consists of the link to a research paper. One sample URL in finalurl is: ‘(http://www.sciencedirect.com/science/article/pii/s000368701300054)’ Each file is identified by PII. The value after the last ‘/’ in the sample URL i.e s000368701300054 corresponds to the PII value of the research paper. We need to extract these PII values and pass them to the URL with which we could retrieve the article.

for url in finalurl:  name = url.rsplit('/', 1)[-1][0:-1]  response,headers =   urllib.request.urlretrieve('https://api.elsevier.com/content/article/pii/'+name+'?apiKey='+key)  filename = os.path.join(directory, response)  shutil.copy(filename, destiantion path of your choice+ name +".xml")

Congratulations, in the destination path you will find the XML files with the name, respective_pii.xml. For the sample URL, the corresponding XML file will be s000368701300054.xml

--

--

Shivani Jadhav

Data scientist Expertise : Information retrieval Area of interest-NLP, Machine learning, deep learning