Category: Databricks

Databricks function for generating XML file

Post author By Martin Misar
Post date September 3, 2020
No Comments on Databricks function for generating XML file

In previous blog post i have mentioned databricks (scala) function for generating XML file. Here it is.

def createXMLFile
(
  parInputDataDataframe: DataFrame,
// -- dataframe with main data, names of columns are used as names of elements
  parXmlDeclaration: String,
// -- declaration XML file, version, encoding, etc.  
  parXmlHeader: String,
// -- XML header which contains any elements which are before main data part
  parXmlMainElementStart: String,
// -- start root tag
  parXmlMainElementEnd: String,
// -- end root tag
  parDestinationFileName: String,
// -- only XML file name  
  parDestinationFilePath: String,
// -- path to XML file (without file name, only path) with "/" on the end
  parXmlRootTag: String,
// -- root tag for main data part
  parXmlRowTag: String
// -- tag for rows of main data
)
{
try  
  {
// mounting of ADLS storage for saving XML document
    mountDestinationStorage("/mnt/export")

// creation of temp XML file, it's for main data, we create xml file with main data using XML library and then we are reading the XML file back as plain text  
// XML library can't save more complex xml files
    parInputDataDataframe.write
    .format("xml")
    .option("rootTag", parXmlRootTag)
    .option("rowTag", parXmlRowTag)
    .mode("overwrite")
    .save("dbfs:/mnt/export/" + parDestinationFilePath + parDestinationFileName)

// reading of the XML file as plain text
  val fileContents = Source.fromFile("/dbfs/mnt/export/" + parDestinationFilePath + parDestinationFileName + "/part-00000").mkString    
  
// composition of the whole final xml file
  val txtFinal = parXmlDeclaration + parXmlMainElementStart + parXmlHeader + "\r\n" + fileContents + parXmlMainElementEnd

// final save of xml file (as plain text, without using of XML library)
  dbutils.fs.put("dbfs:/mnt/export/" + parDestinationFilePath + parDestinationFileName + ".xml", txtFinal, true)  

// deleteion of temp file
  val directory = new Directory(new File("/dbfs/mnt/export/" + parDestinationFilePath + parDestinationFileName + "/"))
  directory.deleteRecursively()  

// unmounting of ADLS storage
    unmountDestinationStorage("/mnt/export")
  }  
catch
  {
   case e: java.rmi.RemoteException => {
      println("Directory is Already Mounted")
      dbutils.fs.unmount("/mnt/export")
    }
    case e: Exception => {
      println("There was some other error.")
    }
  } 
}

Tags databricks, scala, xml

Azure Databricks

Difference between val and var (Azure Databricks)

Post author By Martin Misar
Post date August 20, 2020
No Comments on Difference between val and var (Azure Databricks)

(error: reassignment to val)

If you’ve already had the same error message as me, there’s an error between the chair and the keyboard. And in ignorance of what is the difference between val and var.

The difference between val and var is that val makes a variable immutable and var makes a variable mutable. Because val fields can’t vary, some people refer to them as values rather than variables.

Immutable: In object-oriented and functional programming, an immutable object (unchangeable[1] object) is an object whose state cannot be modified after it is created

Tags databricks

Azure Databricks

Create XML export from Azure SQL database

Post author By Martin Misar
Post date August 16, 2020
1 Comment on Create XML export from Azure SQL database

Task: Export data from Azure SQL database to XML file.

We have an Azure SQL database and we need to export data from it in the form of an XML file.

There is a couple of ways how to create XML file

SQL server: SQL server can nicely save result of query to XML but then saving the file to ADLS is a difficult. I didn’t find easy way how to do it.
Azure Function: Undoubtedly the best way. A simple script for generating XML would suffice. But it would cost quite a lot of research (for me) and I didn’t want that :-).
ADF: Unfortunately, XML is not a supported format. A supported format is JSON, for example.
Logic App: There can be JSON converted to XML. So we can combine ADF and Logic App.
Databricks: Yes, in scala is library which can convert dataframe to XML and it can save it. But it is not almighty. There can be one root node and then only row nodes on one level. But XML can be prepared in plain text form and then saved as XML (via plain text save method).

So final solution is in databricks. At least for now.

Azure Databricks
Although there is a library for working with XML files, if you need to create a more complex document, which does not consist only of the root element and then of individual line elements, then some manual work must be added.
The main idea is to prepare all the important parts of XML separately. The declaration, the tag for the root element, the header, the main data part itself and the footer.

We’ll need a few libraries for all the fun. The main is library com.databricks:spark-xml_2.11:0.9.0 and you can find it in maven.

Declaration
The XML declaration is a processing instruction that identifies the document as being XML.
Example: <?xml version=”1.0″ encoding=”UTF-8″ standalone=”no” ?>

XML declariation does not contain any variable so we can use the classic string value. It’s the same for XML root tag which is formed with start tag parXmlMainElementStart and end tag parXmlMainElementEnd.
parXmlRootTag is for main data and parXmlRowTag is tag for rows in main dataset.

val parXmlDeclaration: String = “<?xml version=’1.0′ encoding=’UTF-8′ standalone=’yes’?>”

val parXmlMainElementStart: String = “<mainRoot>”

val parXmlMainElementEnd: String = “</mainRoot>”

val parXmlRootTag: String = “inquiries”

val parXmlRowTag: String = “inquiry”

The document header is also a standalone value.

val xmlHeader =
<header>
<version>1.1</version>
<firmNumber>NY01017T</firmNumber>
<datetime>20200815</datetime>
</header>

We now have constant values ready. Now just prepare the row data itself. You need to have the data loaded in the dataframe. Then all you have to do is call the write method.

parDfInputData.write
.format(“xml”)
.option(“rootTag”, parXmlRootTag)
.option(“rowTag”, parXmlRowTag)
.mode(“overwrite”)
.save(“dbfs:/mnt/export/data”)

This saved an XML document with the line data itself. A new directory has also been created that contains xml data in a file named part-000000. You only need to use the contents of this file and glue it all together with the rest.
So we load the just created XML document as an ordinary text file into a text value.

val fileContents = Source.fromFile(“/dbfs/mnt/export/data/part-00000”).mkString

And let’s put it all together.

val txtFinal = parXmlDeclaration + parXmlMainElementStart + parXmlHeader + “\r\n” + fileContents + parXmlMainElementEnd

Now we will save XML document as a plain text file.

dbutils.fs.put(“dbfs:/mnt/export/data” + “.xml”, txtFinal, true)

And finally we clean up, the unwanted directory, after ourselves.

val directory = new Directory(new File(“/dbfs/mnt/export/data/” ))
directory.deleteRecursively()

In the end you can create function for creating XML documents from sql database. But that will be in the next post :-).

Tags databricks, xml