Friday, May 31, 2019

dampy – A Python tool to work with AEM DAM


Disclaimer

The author of this blog is also the creator of the dampy tool. So in a sense you can call it an exercise in self-marketing. But this tool has immensely helped me handle a lot of frequent ad-hoc client requests very comfortably. In-fact it was started as creating one off scripts to handle each requirement and dampy is the consolidation of many such scripts into a comprehensive tool to work on AEM DAM. And I will be glad if it is of use to someone out there

About the tool

dampy is a tool to work with AEM DAM. For a client I recently work with, as we went live the client team had frequent requests to
  • Get a list of all the assets under a path
  • Download all assets under a path
  • Upload assets organized in the local folder structure to DAM
  • Extract specific properties of all assets under a path
  • Update properties of assets based on spreadsheet
  • And so on…

After dabbling with curls, AEM reports and WebDAV tools, I came to realize that writing Python scripts to make REST API calls to AEM and/or convert the result JSON into required output format to be the quickest and easiest option to handle these requests. dampy is a consolidation of many such scripts written into a comprehensive tool to work with AEM DAM


Getting Started

dampy is available as pip install. To start working on this tool install it through pip command

pip install dampy


After the tool is installed, you can straight away start using it. All that it takes is to import AEM class from dampy package and start working with it. The below code is all it takes to get a list of all assets in DAM


# Getting started
# Import AEM from dampy.AEM
>>> From dampy.AEM import AEM

# Create a handle to an AEM Instance
>>> author = AEM()

# List all assets in DAM
>>> author.dam.list()


As you can see, three lines is all it takes to get a list of all assets in DAM
1.       Import AEM from dampy.AEM
2.       Create an instance of AEM
3.       Call the list method

Note: By default, dampy connects to AEM Author instance running locally with admin/admin as credentials. Keep your local AEM instance running when trying the above snippet.


dampy in-depth

The following sections explains in-depth the functionalities and technical details of dampy tool


Creating an AEM handle

The first step in working with dampy is to create an AEM handle. To create an AEM handle, import AEM from dampy.AEM and create an instance of it. Below code snippet shows various options to create an AEM handle


>>> From dampy.AEM import AEM
# Different options for creating AEM handle
# Create AEM handle to default AEM instance
>>> aem = AEM()

# Create AEM handle to the given host and credentials
>>> aem = AEM(‘http://my-aem-host:port’, ‘user’,’password’)

# Create AEM handle to default host, for admin user with password as ‘new-password’
>>> aem = AEM(password = ‘new-password’)


As you can see, AEM constructor takes in three optional parameters

host – defaults to http://localhost:4502
user – defaults to ‘admin’
password – defaults to ‘admin’

You can pass in none, all three, or some of these parameters to create an AEM handle

The dam object in AEM handle

The AEM handle includes a dam object wrapped within it and all functionalities of dampy are exposed as methods on this dam object. The signature for invoking any method on dam looks like this

>>> aem.dam.<api>(<params...>)

We will see all the methods exposed on dam object in the following sections

list()

This method takes an optional path parameter and returns a list of all assets under this path. The returned list includes assets under all the subfolders of this path, subfolders under it and so on covering the entire node tree under the given path
If path parameter is not provided, it returns all assets in DAM (all assets under /content/dam)
It also has two more optional parameters

csv_dump – A Boolean flag indicating if the list needs to be written to a CSV file. Default value of this flag is False

csv_file – The output csv file name to write the list output to. The list gets written to output if either the csv_dump is set to True of csv_file value is specified. 

If csv_dump is set to True but no csv_file value is specified, output is written to the file 'output/asset_list.csv' under the current working directory


# List all assets under ‘/content/dam/my_folder’
>>> my_assets = aem.dam.list(‘/content/dam/my_folder’)

# List all assets under ‘/content/dam’
>>> all_assets = aem.dam.list()

# List all assets under ‘/content/dam/my_folder’ and also write the output to a CSV file
>>> my_assets = aem.dam.list(‘/content/dam/my_folder’, csv_dump=True)

# List all assets under ‘/content/dam/my_folder’ and also write the output to the CSV file specified
>>> my_assets = aem.dam.list(‘/content/dam/my_folder’, csv_file=’output/list.csv’, ) 


createFolder()

This method creates a new folder in DAM. It takes in the path of the folder to create as a parameter and an optional title for the folder & returns a Boolean value indicating the success status of folder creation. Parameters

path – DAM folder path to create
title – Optional title for the folder. If not provided, name of the folder is set as title


# Create a new DAM folder /content/dam/new_folder and set its title to ‘new_folder’
>>> status = aem.dam.createFolder(‘/content/dam/new_folder’)

# Create a new DAM folder /content/dam/new_folder and set its title to ‘My New Folder’
>>> status = aem.dam.createFolder(‘/content/dam/new_folder’, ‘My New Folder’)

uploadAsset()

This method uploads an asset from the local path to DAM under the path specified. It takes in 2 parameters

file – Path of the local file to upload. This is a mandatory parameter
path – DAM path under which the file has to be uploaded, Defaults to ‘/content/dam’ if not specified

This method returns a Boolean value indicating the success status


# Upload the given file to the specified DAM folder
>>> status = aem.dam.uploadAsset(‘./assets/sample1.png’, ‘/content/dam/new_folder)

# Upload the given file to DAM under ‘/content/dam’
>>> status = aem.dam.uploadAsset(‘./assets/sample1.png’)

uploadFolder()

This method uploads all the assets from a local folder to DAM. It takes in 2 parameters

dir – The local folder path, all assets under which gets uploaded to DAM. This is an optional parameter and uploads all assets under folder named ‘upload’ under the current path if not specified

path – DAM path under which to download. Its optional and defaults to /content/dam. For assets under the folder structure starting with /content/dam/… in the local folder specified, this parameter is ignored

The folder structure under the given local folder dir gets reflected under the DAM path provided.
This method also returns a Boolean value indicating the success status


# Upload all the folders and assets under ./upload to DAM under /content/dam
>>> status = aem.dam.uploadFolder()

# Upload all the folders and assets under ./upload to DAM under /content/dam/my_uploads
>>> status = aem.dam.uploadFolder(path=’/content/dam/my_uploads)

# Upload all the folders and assets under ./assets to DAM under /content/dam
>>> status = aem.dam.uploadFolder(‘./assets’)

# Upload all the folders and assets under ./assets to DAM under /content/dam/my_uploads
>>> status = aem.dam.uploadFolder(dir=‘./assets’, path=’/content/dam/my_uploads)



downloadAsset()

This method downloads the given assets to a local folder. This method takes in three parameters
asset_path – A mandatory parameter which is the full path to the asset to download

dir – local folder path to download the asset. This is optional and the asset gets downloaded to a folder named ‘download’ if not specified. The folder gets created is the given folder does not exist on the file system

retain_dam_path – A Boolean flag to retain or ignore the DAM folder tree when downloading to local. Defaults to False

This method returns a Boolean value indicating the success or failure of the download


# Download the asset to ./download folder
>>> status = aem.dam.downloadAsset(‘/content/dam/my_folder/dampy_sample.png’)

# Download the asset to ./assets folder
>>> status = aem.dam.downloadAsset(‘/content/dam/my_folder/dampy_sample.png’, ‘./assets’)

# Download the asset to ./assets folder under the subfolder /content/dam/my_folder
>>> status = aem.dam.downloadAsset(‘/content/dam/my_folder/dampy_sample.png’, ‘./assets’, True)


downloadFolder()

This method downloads all the assets under a given DAM path to a local folder. This method takes in three parameters

path – Optional. Path of the DAM folder from which all assets gets downloaded. If this parameter is not given all assets under ‘/content/dam’ gets downloaded

dir – local folder path to download all the asset to. This is optional and the assets get downloaded to a folder named ‘download’ if not specified. The folder to download the assets to gets created if it does not already exist on the file system.

retain_dam_path – A Boolean flag to retain or ignore the DAM folder tree when downloading to local. Defaults to True

The tree structure of DAM is retained when downloading assets under the given DAM path, with the downloaded folder structure reflecting the DAM folder hierarchy

This method also returns a Boolean value indicating the success or failure of the download


# Download all assets under the given DAM folder to ./download, retaining the DAM folder structure in local

>>> status = aem.dam.downloadFolder(‘/content/dam/my_folder’)

# Download all assets under the given DAM folder to ./assets folder, retaining the DAM folder structure in local
>>> status = aem.dam.downloadFolder(‘/content/dam/my_folder’, ‘./assets’)

# Download all assets under the given DAM folder to ./assets folder. All assets are placed in ./assets folder, ignoring the DAM folder structure
>>> status = aem.dam.downloadFolder(‘/content/dam/my_folder’, ‘./assets’, False)



metadata()

This method returns the metadata of the given asset as a json object. It takes in 2 parameters

asset_path – A mandatory parameter which is the full path to the asset
level – optional, the nesting level in the node hierarchy includes in the response json. Defaults to 1


# Get level 1 metadata of given asset
>>> metadata_json = aem.dam.metadata(‘/content/dam/my_folder/dampy_sample.png’)

# Get up to level 4 metadata of the given asset
>>> metadata_json = aem.dam.metadata(‘/content/dam/my_folder/dampy_sample.png’, 4)


xprops()

This method extracts the metadata properties of all the assets under a given path and writes it to a CSV file. It takes 3 parameters, all optional

path – DAM path. Extracts properties of all assets under this path. Defaults to ‘/content/dam’

props – List of properties to extract. By default extract the asset path & title

csv_file – The output file to write the extracted properties to. By default, its written to the file ‘output/asset_props.csv’


# Extract path and title of all dam assets and write it to output/asset_props.csv
>>> status = aem.dam.xprops()

# Extract path and title of all dam assets under my_folder and write it to output/asset_props.csv
>>> status = aem.dam.xprops(‘/content/dam/my_folder’)

# Extract path, title and tags of all dam assets under my_folder and write it to output/asset_title_n_tags_.csv
>>> status = aem.dam.xprops(‘/content/dam/my_folder’, [‘jcr:path’, ‘jcr:content/metadata/dc:title’, ‘jcr:content/metadata/cq:tags’], ‘output/asset_title_n_tags_.csv’)


uprops()

This method takes a csv file as input and updates asset properties with the data provided in this CSV file. It takes 1parameter, the path to the CSV file

csv_file  – Path to the csv file with data for asset properties update. By default, tries the read the input csv file at in ‘input/asset_props.csv’

This input CSV file should adhere to the below conditions
  1. The first row is the header and should have the property name to update for the respective columns. Property name is the fully qualified name of the property under the asset. E.g. Title property name is ‘jcr:content/metadata/dc:title’
  2. The second row is the type of the property. Can be String, Date, Boolean, … and can be a single value or array value. E.g. for String array mention the type as ‘String[]’
  3. From row 3 onwards, each row contains the properties for one asset
  4. The first column must be ‘jcr:path’ property with its type as String and values as full path of the asset in DAM 

After creating the csv and placing it in a path, invoke the uprops method as given in the below code snippet


# Update properties based on the input csv file input/asset_props.csv
>>> status = aem.dam.uprops()

# Update properties based on the input csv file input/asset_cust_props.csv
>>> status = aem.dam.uprops(‘input/asset_cust_props.csv’)


activate()

This method activates the given asset or a folder in DAM. It takes in one mandatory path parameter

path – Mandatory parameter specifying the path to the asset or DAM folder that needs to be activated

This method returns a Boolean value indicating the success status


# Activates the given asset in DAM
>>> status = aem.dam.activate((‘/content/dam/my_folder/dampy_sample.png’)

# Activates the given folder (folder tree) in DAM
>>> status = aem.dam.activate((‘/content/dam/my_folder’)


deactivate()

This method deactivates a given asset or a folder in DAM.  It takes in one mandatory path parameter

path – Mandatory parameter specifying the path to the asset or a DAM folder that needs to be deactivated

This method returns a Boolean value indicating the success status


# Deactivates the given asset in DAM
>>> status = aem.dam.deactivate((‘/content/dam/my_folder/dampy_sample.png’)
# Deactivates the given folder in DAM
>>> status = aem.dam.deactivate((‘/content/dam/my_folder’)



delete()

This method deletes a given asset or a folder. It takes in one mandatory path parameter

path – Mandatory parameter specifying the path to the asset or the DAM folder that needs to be deleted

This method returns a Boolean value indicating the success status


# Deletes the given asset from DAM
>>> status = aem.dam.delete((‘/content/dam/my_folder/dampy_sample.png’)

# Deletes the given folder from DAM
>>> status = aem.dam.delete((‘/content/dam/my_folder’)


Wednesday, May 22, 2019

How to configure Dispatcher Flush Agents on Publisher?


Normally the dispatcher flush agents are configured on the author under the section ‘Agents on Publish’ and activated for it to be replicated to publish instances where it takes effect. This works when the same configuration is needed on all the publish instances as any node that gets replicated would get replicated to all publishers for which a replication agent is configured

For cases where the configuration of dispatcher flush agent(s) needed on each publish instance differs we can use one of the below approach

Direct configuration on Publish instances

Make the configuration directly on Publish instance. This involves logging in to each publish instance with admin credentials and creating the required dispatcher flush agent configuration for that instance. Avoid using this approach for higher environments. Could be a useful approach for development and test environment to get the configuration done quickly.

Using CURL scripts to create dispatcher flush agents

Use CURL to create dispatcher flush agents needed on each publish instance. This is the most widely used approach and the CURL scripts can be maintained for recreating instances in case of server rebuilds.

Using packages to install the configuration on publish instances

Configure all the dispatcher flush agents needed on Author. Create packages – one for each publish instance with flush agents needed for that instance. Install the created package on the corresponding publish instance.

Dispatcher Farms and Cache Invalidation


One thing to be aware of when using multiple farms configuration in the dispatcher is the anatomy of cache invalidation requests.

Cache invalidation requests are sent to the URL /dispatcher/invalidate.cache of the dispatcher. On receiving this request dispatcher checks for the CQ-Handle http header field which contains the path of the resource to be invalidated and performs the invalidation based on the configuration on the dispatcher.

When dispatcher is configured with multiple farms, it’s natural to expect that the invalidation request matches the farm based on the resource that is invalidated. But this does not happen.
Invalidation request tries to match the farm based on URL of the invalidation request which is
/dispatcher/invalidate.cache

The configuration in the matching farm identified for this URL is used for invalidation behavior. Since the invalidation URL for all the resources in AEM is the same, all invalidation requests irrespective of the resource being invalidated picks up the same matching farm. 

This poses a challenge in having different invalidation configuration based on the resource being invalidated

URL Rewrite using CQ-Handle


One simple solution to handle this is to use URL rewrite on Apache to pre-pend the value from the CQ-Handle header field to the URL. This would prefix the URL with the path of the resource being flushed.

When this URL gets processed by dispatcher, it would pick up the matching farm based on the path of the resource being flushed and uses that farm configuration for invalidation

Configuring dispatcher flush agents on Publish – Design Challenges


Having the dispatcher flush agents on publisher instead of on the author has its benefits as detailed here. But configuring the dispatcher flush agents on the publisher has to be carefully thought though and made suitable for your environment.


The mapping between the dispatcher and publisher plays a crucial role in this design. Dispatcher to publisher relationship could be 
  • One to one
  • One to many
One to one configuration is straight forward. All that needs to be done in this case is to configure the flush agent for the dispatcher on that one publish instance it is mapped to. This would make sure that for any resource that gets replicated to that publish instance, a flush request is sent to its mapped dispatcher and the resource requested gets flushed from the dispatcher cache


But the one to many mapping configuration poses issues that needs to be resolved as per the application requirements. 


The key question to answer is when one dispatcher is mapped to multiple publishers how many and which publisher(s) should invalidate that dispatchers cache? Can many publishers flush the cache of single dispatcher? Would this configuration lead to some form of race condition due to asynchronous nature of replication and cache invalidation? These questions needs to be carefully analyzed based on the application scenarios.

Typically, the single dispatcher to multiple publisher mapping configuration falls into one of the below two categories
  • Aggregation configuration where a single dispatcher aggregates the requests flow to two or more publishers
  • Composition configuration where each dispatcher in the environment is connected to all publishers in the environment


Aggregation Configuration

In this configuration, dispatcher acts as an aggregator for two or more publish instances. A simple form of aggregation configuration is depicted in the diagram below












In this configuration, an optimal solution would be to configure the flush agents for a dispatcher on all the publish instances that the dispatcher aggregates. For the above configuration configure flush agents for D1 on both P1 and P2 and configure the flush agents for D2 on both P3 and P4.

This would result in duplicate flushing of cached content on dispatcher but would avoid race condition from occurring. 

A worst case scenario that could occur in this configuration on activation of a resource R1 would be the following sequence

  1. Replication requests for R1 gets placed for P1 & P2
  2. R1 gets replicated to P1, replication to P2 gets delayed
  3. P1 flushes R1 from D1
  4. User requests R1 from D1
  5. D1 does not have R1 in cache, goes to P2 to fetch R1
  6. P2 serves older version of R1 as replication has not happened on P2 yet
  7. D1 caches older version of R1 again and serves it as response to user request
  8. Now replication of R1 to P2 happens
  9. At this stage P2 flushes R1 from D1 – the older version that got cached gets flushed
  10. A subsequent user request for R1 will now cache new version of R1 as both P1 and P2 have newer version of R1 at this stage

Though this configuration results in duplicate cache flushing from multiple publishers on to a single dispatcher, it makes sure that stale content does not live longer on dispatcher cache due to race conditions

Composition configuration

In this configuration, dispatchers and publishers are mapped in many to many configuration. It has advantages that a dispatcher can load balance across multiple publishers and the same publisher can be the renderer for multiple dispatchers thus providing maximum fault tolerance. 

A simple form of this configuration is depicted below












In this configuration, each dispatcher is connected to all the publishers. Another simple and most common configuration depicted below, in which the dispatchers are connected to an external load balancer, with that load balancer distributing the requests across the publish instances would also result in the same composition mapping scenario


















While it provides maximum fault tolerance, this configuration is not optimal for handling dispatcher cache flushing especially when it comes to applications with frequently changing content. 

Options that can be considered for dispatcher cache flush configuration are
  • All publisher’s flushes cache of all dispatchers – causes network overhead and too many redundant cache flushes
  • A minimal subset of publisher’s as flushing publishers for each dispatcher. This reduces the race condition though not completely avoiding it
  • Have one flushing publisher for each dispatcher. This could lead to race condition
  • Have flushing of dispatcher cache done from author. Simple to configure and maintain but could lead to race condition
  • Purge the dispatcher cache periodically through an external mechanism (say curl fired periodically through corn)


As a best practice avoid having composition mapping configuration. Consider the Dispatcher & Publisher combined (though one to one mapping or aggregate configuration) as a single unit for scaling the publish side capacity. 


In case where composition configuration in unavoidable, supplement with periodic purging of dispatcher cache through an external mechanism stale content getting cached in rare cases do not live longer on the dispatcher.





Dispatcher Cache Invalidation – A Race condition to be aware of


The default option for configuring dispatcher cache flushing is to configure the flush agents on the Author similar to the replication agent configuration. Have one flush agent configured for each dispatcher in the environment.

When a content gets activated, a request gets queued for each dispatcher flush agent configured and cache flush requests would be sent to the dispatcher based on the queue items asynchronously.

This is a simple configuration to do and maintain with all the configuration getting managed on the Author. But be aware that this way of dispatcher cache flushing might lead to a race condition.

This is because both the replication and dispatcher cache flushing happens asynchronously and the order of completion of each event is not guaranteed.

Consider a simple scenario where we have one author, one publisher and one dispatcher in the configuration. When a page is activated, requests for replication to publisher and flushing for the dispatcher cache gets placed and are processed asynchronously.

Now with the order of processing of these two events not guaranteed, it would result in one of the following two possibilities
  • Replication to the publisher happens first followed by the flushing of dispatcher cache
  • Flushing of dispatcher cache happens first followed by the replication to the publisher

The first scenario is the desired behavior, but the second scenario could result in a race condition if between the flushing of dispatcher cache and replication to the publisher, a user request happens for the resource being flushed

This scenario is depicted in the below diagram where a user request happens in-between dispatcher cache flushing and replication to publisher




In this scenario the user request is forwarded to the publisher as the content for the requested  resource on dispatcher is already flushed. Now the publisher would serve the older version for the content as replication for the modified content has not happened on the publisher yet.

The dispatcher would consider this version as new content and use it for all subsequent requests and would not be aware of subsequent replication completion on the publisher side.  

For this reason, for any non-trivial application its highly recommended to configure the dispatcher flush agents on the publisher side and trigger the flush action from publisher through chain replication mechanism.

The same scenario with dispatcher flush agents configured on the publish instance is depicted in the diagram below




In this case, requests from the user between step 1 & 2 would get the older version of the content from cache. Step 2 would flush the dispatcher cache and requests made after step 2 will fetch the new content from publish and cache it for subsequent requests, thus eliminating the race condition

Configuring the flush agents on publishers is not without its challenges. We will explore the challenges with arriving at the optimal design for configuring dispatcher flush agents on publish instances in the next blog here

Connected Assets

This is a feature introduced in 6.5 release.  To understand the concept of connected assets clearly, it is essential to understand th...