SecureDL: Securing Code Execution and Access Control for Distributed Data Analytics Platforms. (arXiv:2106.13123v2 [cs.CR] UPDATED)

Distributed data analytics platforms such as Apache Spark enable
cost-effective processing and storage. These platforms allow users to
distribute data to multiple nodes and enable arbitrary code execution over this
distributed data. However, such capabilities create new security and privacy
challenges. First, the user-submitted code may potentially contain malicious
code to circumvent existing security checks. In addition, providing
fine-grained access control for different types of data (e.g., text, images,
etc.) may not be feasible for different data storage options. To address these
challenges, we provide a fine-grained access control framework tailored for
distributed data analytics platforms, which is protected against evasion
attacks with two distinct layers of defense. Access control is implemented with
runtime injection of access control logic on a submitted data analysis job. The
proactive security layer utilizes state-of-the-art program analysis to detect
potentially malicious user code. The reactive security layer consists of binary
integrity checking, instrumentation-based runtime checks, and sandboxed
execution. To the best of our knowledge, this is the first work that provides
fine-grained attribute-based access control for distributed data analytics
platforms using code rewriting and static program analysis. Furthermore, we
evaluated the performance of our security system under different settings and
show that the performance overhead due to added security is low.