Accepted Papers

We are pleased to announce the papers that have been accepted to be presented at ACM Multimedia 2019 as part of the main track. The list is grouped into clusters that reflect topics represented at the conference (combination of areas and themes). (Note that there is no meaning to the order of the papers within each cluster, titles might change slightly in the final version of the paper.) We are currently mapping clusters to sessions and will announce the program soon. All papers will be presented as poster (A0 portrait). Additionally, selected papers will be invited to give oral presentations and other papers will be invited to given flash presentations.

Engaging users with multimedia

Effective Sentiment-relevant Word Selection for Multi-modal Sentiment Analysis in Spoken Language
A Multimodal View into Music’s Effect on Human Neural, Physiological, and Emotional Experience
PDANet: Polarity-consistent Deep Attention Network for Fine-grained Visual Emotion Regression
Inferring Mood Instability via Smartphone Sensing: a Multi-View Learning Approach
Comp-GAN: Compositional Generative Adversarial Network in Synthesizing and Recognizing Facial Expression
Understanding the Teaching Styles by an Attention based Multi-task Cross-media Dimensional Modeling
Towards Increased Accessibility of Meme Images with the help of Rich Face Emotion Captions
Emotion Recognition Using Multimodal Residual LSTM Network
Multimodal Deep Denoise Framework for Affective Video Content Analysis
Predicting and Understanding News Social Popularity with Emotional Salience Features
Human-imperceptible Privacy Protection Against Machines
Identity- and Pose-Robust Facial Expression Recognition through Adversarial Feature Learning
Occluded Facial Expression Recognition Enhanced through Privileged Information
Explainable Interaction-driven User Modeling over Knowledge Graph for Sequential Recommendation
Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching
Collaborative Preference Embedding against Sparse Labels
Adversarial Preference Learning with Pairwise Comparisons
Learning Local Similarity with Spatial Relations for Object Retrieval
MMJN: Multi-Modal Joint Networks for 3D Shape Recognition
Dual-Level Embedding Alignment Network for 2D Image-Based 3D Object Retrieval
Learning Disentangled Representation for Cross-Modal Retrieval with Deep Mutual Information Estimation
Hierarchical Graph Semantic Pooling Network for Multi-modal Community Question Answer Matching
Domain-Specific Embedding Network for Zero-Shot Recognition
Finding Images by Dialoguing with Image
User Diverse Preference Modeling via Multimodal Attentive Metric Learning
MMGCN: Multimodal Graph Convolution Network for Personalized Recommendation of Micro-video
Personalized Hashtag Recommendation for Micro-videos
W2VV++: Fully Deep Learning for Ad-hoc Video Search
User-Aware Folk Popularity Rank: User-Popularity-Based Tag Recommendation that can Enhance Social Popularity
Routing Micro-videos via A Temporal Graph-guided Recommendation System
A Framework for Effective Known-Item Search in Video
Vision-Language Recommendation via Attribute Augmented Multimodal Reinforcement Learning
Separated Variational Hashing Networks for Cross-Modal Retrieval
Seeking Micro-influencers for Brand Promotion
MvsGCN: A Novel Graph Convolutional Network for Multi-video Summarization
Unsupervised Video Summarization with Attentive Conditional Generative Adversarial Networks
Multi-modal Knowledge-aware Event Memory Network for Social Media Rumor Detection
Who, Where, and What to Wear? Extracting Fashion Knowledge from Social Media
Informative Visual Storytelling with Cross-modal Rules
CRA-Net: Composed Relation Attention Network for Visual Question Answering
Stacked Memory Network for Video Summarization
Personalized Capsule Wardrobe Creation with Garment and User Modeling
Generating One Minute Summaries of Day Long Egocentric Videos

Multimedia experience

Quality Assessment of In-the-Wild Videos
Cross-Reference Stitching Quality Assessment for 360° Omnidirectional Images
360° Mulsemedia: A Way to Improve Subjective QoE in 360° Videos
ViProVoQ: Towards a Vocabulary for Video Quality Assessment in the Context of Creative Video Production
SGDNet: An End-to-End Saliency-Guided Deep Neural Network for No-Reference Image Quality Assessment
iDFusion: Globally Consistent Dense 3D Reconstruction from RGB-D and Inertial Measurements
DTDN: Dual-task De-raining Network
Generalized Playback Bar for Interactive Branched Video
FGLmser: A Flexible GAN-Lmser for Super-Resolution
Stereoscopic Visual Discomfort Prediction Using Multi-scale DCT Features
Facially Image-to-Video Translation by A Hidden Affine Transformation
GP-GAN: Towards Realistic High-Resolution Image Blending
Recognizing the Style of Visual Arts via Adaptive Cross-layer Correlation
Progressive Image Inpainting with Full-Resolution Residual Network
Melody Slot Machine: A Controllable Holographic Virtual Performer
Generating Captions for Images of Ancient Artworks
Super Resolution Using Dual Path Connections
Measuring the Innovation of Courseware in E-education Systems
GAIN: Gradient Augmented Inpainting Network for Irregular Holes
Scalable and Diverse cross-domain Image Translation
AnoPCN: Video Anomaly Detection via Deep Predictive Coding Network
Single Image Deraining via Recurrent Hierarchy Enhancement Network
Gradual Network for Single Image De-raining
STDGAN: ResBlock Based Generative Adversarial Nets Using Spectral Normalization and Two Different Discriminators
Lightweight Image Super-Resolution with Information Multi-distillation Network
Online Camera Pose Optimization for the Surround-view System
Monocular Depth Estimation as Regression of Classification using Piled Residual Networks
Audiovisual Zooming: What You See Is What You Hear
Adversarial Colorization of Icons based on Contour and Color Conditions
BasketballGAN: Generating Basketball Play Simulation through Sketching
Virtually Trying on New Clothing with Arbitrary Poses
Impact of Saliency and Gaze Features on Visual Control: Gaze-Saliency Interest Estimator
Fine-grained Fitting Experience Prediction: a 3D-slicing Attention Approach
AI Coach: Creating Personal Athletic Training Experiences by Human Pose Analysis in Videos
Editing Text in the Wild
LiveSense: Contextual Advertising in Live Streaming Videos
Towards Automatic Face-to-Face Translation
Towards a Perceptual Loss: Using a Neural Network Codec Approximation as a Loss for Generative Audio Models
Vision-based Price Suggestion for Online Second-hand Items

Multimedia systems

A Novel Two-stage Separable Deep Learning Framework for Practical Blind Watermarking
A Two-step Cross-modal Hashing by Exploiting Label Correlations and Preserving Similarity in Both Steps
Deep Hashing by Discriminating Hard Examples
Supervised Discrete Hashing With Mutual Linear Regression
Open Set Deep Learning with A Bayesian Nonparametric Generative Model
3D Point Cloud Geometry Compression on Deep Learning
Flexible Online Multi-modal Hashing for Large-scale Multimedia Retrieval
Eye in the Sky: Drone-Based Object Tracking and 3D Localization
Visual-Inertial State Estimation with Pre-integration Correction for Robust Mobile Augmented Reality
Themis: Efficient and Adaptive Resource Partitioning for Reducing Response Delay in Cloud Gaming
DeepQuantizedCS: Quantized Compressive Video Recovery using Deep Convolutional Networks
WealthAdapt: A General Network Adaptation Framework for Small Data Tasks
Real-Time Gesture Recognition Using 3D Sensory Data and a Light Convolutional Neural Network
Close the Gap between Deep Learning and Mobile Intelligence by Incorporating Training in the Loop
Monocular Visual Object 3D Localization in Road Scenes
PiTree: Practical Implementation of ABR Algorithms Using Decision Trees
Livesmart: a QoS-Guaranteed Cost-minimum Framework of Viewer Scheduling for Crowdsourced Live Streaming
Comyco: A Video Quality-aware Adaptive Bitrate Approach via Imitation Learning
Towards 6DoF HTTP Adaptive Streaming Through Point Cloud Compression
Lossy Intermediate Deep Learning Feature Compression and Evaluation
Low-Latency Channel-Adaptive Error Control for Interactive Streaming
Band and Quality Selection for Efficient Transmission of Hyperspectral Images
AdaCompress: Adaptive Compression for Online Computer Vision Services
Talking Video Heads – Saving Streaming Bitrate by Adaptively Applying Object-based Video Principles to Interview-like Footage
Navigation Graph for Tiled Media Streaming
CACA: Learning-based Content-Aware Cache Admission for Video Content in Edge Caching

Multimodal Fusion

Modality-aware Collaborative Learning for Visible Thermal Person Re-Identification
TC-Net for iSBIR: Triplet Classification Network for Instance-level Sketch Based Image Retrieval
Knowledge-guided Pairwise Reconstruction Network for Weakly Supervised Referring Expression Grounding
Small and Dense Commodity Object Detection with Multi-Receptive Field Attention
Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking
A New Benchmark and Approach for Fine-grained Cross-media Retrieval
Aberrance-aware Gradient-sensitive Attentions for Scene Recognition with RGB-D Videos
Learning Fragment Self-Attention Embeddings for Image-Text Matching
Dual-alignment Feature Embedding for Cross-modality Person Re-identification
Watch, Reason and Code: Learning to Represent Videos Using Program
Adaptive Semantic-Visual Tree for Hierarchical Embeddings
Deep Spatial Pyramid Features Collaborative Reconstruction for Partial Person Re-identification
SRN: Structured Stochastic Recurrent Network for Linguistic Video Prediction
Learning Using Privileged Information for Food Recognition
GP-BPR: Personalized Compatibility Modeling for Clothing Matching
Deep Adversarial Graph Attention Convolution Network for Text-Based Person Search
Audiovisual Transformer Architectures for Large-Scale Classification and Synchronization of Weakly Labeled Audio Events
Cost-free Transfer Learning Mechanism: Deep Digging Relationships of Action Categories
Fine-grained Cross-media Representation Learning with Deep Quantization Attention Network
Multimodal Dialog System: Generating Responses via Adaptive Decoders
Multimodal Classification of Urban Micro-Events
Mutual Correlation Attentive Factors in Dyadic Fusion Networks for Speech Emotion Recognition
Cross-Modal Subspace Learning with Scheduled Adaptive Margin Constraints
Cycle-consistent Conditional Adversarial Transfer Networks
LinesToFacePhoto: Face Photo Generation From Lines With Conditional Self-Attention Generative Adversarial Networks

Vision and Language

Curiosity-driven Reinforcement Learning for Diverse Visual Paragraph Generation
Aligning Linguistic Words and Visual Semantic Units for Image Captioning
Sentence Specified Dynamic Video Thumbnail Generation
Heterogeneous Domain Adaptation via Soft Transfer Network
Aesthetic Attributes Assessment of Images
Watch It Twice: Video Captioning with a Refocused Video Encoder
Hierarchical Global-Local Temporal Modeling for Video Captioning
Video-Based Cross-Modal Recipe Retrieval
MUCH: MUtual Coupling enHancement of Scene Recognition and Dense Captioning
Multi-interaction Network with Object Relation for Video Question Answering
Cycle In Cycle Generative Adversarial Networks for Keypoint-Guided Image Generation
Multi-modal Knowledge-aware Hierarchical Attention Network for Explainable Medical Question Answering
Walking with MIND: Mental Imagery eNhanceD Embodied QA
Unpaired Cross-lingual Image Caption Generation with Self-Supervised Rewards
Visual Relationship Detection with Relative Location Mining
Cross-Modal Image-Text Retrieval with Semantic Consistency
IntersectGAN: Learning Domain Intersection for Generating Images with Multiple Attributes
Learnable Aggregating Net with Divergent Loss for Video Question Answering
Cross-Modal Dual Learning for Sentence-to-Video Generation
Preserving Semantic and Temporal Consistency for Unpaired Video-to-Video Translation
Attention-based Densely Connected LSTM for Video Captioning
Weakly Supervised Fine-grained Image Classification via Correlation-guided Discriminative Learning
Erasing-based Attention Learning for Visual Question Answering
Exploiting Temporal Relationships in Video Moment Localization with Natural Language
Attention Transfer (ANT) Network for View-invariant Action Recognition
Hierarchical Visual Relationship Detection
Question-Aware Tube-Switch Network for Video Question Answering
Annotation Efficient Cross-Modal Retrieval with Adversarial Attentive Alignment
Diachronic Cross-modal Embeddings
Critic-based Attention Network for Event-based Video Captioning
Alleviating Feature Confusion for Generative Zero-shot Learning
Semi-supervised Deep Quantization for Cross-modal Search
Referring Expression Comprehension with Semantic Visual Relationship and Word Mapping

Media Interpretation

You Only Recognize Once: Towards Fast Video Text Spotting
Ingredient-Guided Cascaded Multi-Attention Network for Food Recognition
SRINet: Learning Strictly Rotation-Invariant Representations for Point Cloud Classification and Segmentation
DoT-GNN: Domain-Transferred Graph Neural Network for Group Re-identification
Duet Robust Deep Subspace Clustering
Predicting Future Instance Segmentation with Contextual Pyramid ConvLSTMs
MetaAdvDet: Towards Robust Detection of Evolving Adversarial Attacks
Video Retargeting: Trade-off between Content Preservation and Spatio-temporal Consistency
PAN: Persistent Appearance Network with an Efficient Motion Cue for Fast Action Recognition
Intrinsic Image Popularity Assessment
Exploring Background-bias for Anomaly Detection in Surveillance Videos
L2G Auto-encoder: Understanding Point Clouds by Local-to-Global Reconstruction with Hierarchical Self-Attention
An Attentional-LSTM for Improved Classification of Brain Activities Evoked by Images
Exploit the Connectivity: Multi-Object Tracking with TrackletNet
Fast Non-Local Neural Networks with Spectral Residual Learning
Pedestrian Attribute Recognition by Deep Hierarchical Multi-task Learning with Correlation Attention Mechanism
Deep Fusion Network for Image Completion
Defending Against Adversarial Examples via Soft Decision Tree Embedding
On Learning Disentangled Representation for Acoustic Event Detection
Single-shot Semantic Image Inpainting with Densely Connected Generative Networks
M2E-Try On Net: Fashion from Model to Everyone
Adaptive Multi-Path Aggregation for Human DensePose Estimation In The Wild
Explainable Video Action Reasoning via Prior Knowledge and State Transitions
Unsupervised Domain Adaptation for 3D Human Pose Estimation
Self-supervised Representation Learning using 360° Data
Joint-attention Discriminator for Accurate Super-resolution via Adversarial Training
Tell Me Where is Still Blurry: Adversarial Blurred Region Mining and Refining
Optimized Skeleton-based Action Recognition via Sparsified Graph Regression
Self-supervised Face-Grouping on Graphs
Multi-Level Fusion based Class-aware Attention Model for Weakly Labeled Audio Tagging
Towards Optimal CNN Descriptors for Large-Scale Image Retrieval
Generative Reconstructive Hashing for Incomplete Video Analysis
Dense Feature Aggregation and Pruning for RGBT Tracking
360-degree Video Gaze Behaviour: A Ground-Truth Data Set and a Classification Algorithm for Eye Movements
Action Recognition with Bootstrapping based Long-range Temporal Context Attention
Progressive Retinex: Mutually Reinforced Illumination-Noise Perception Network for Low-Light Image Enhancement
Embodied One-Shot Video Recognition: Learning from Actions of a Virtual Embodied Agent
Attacking Gait Recognition Systems via Silhouette Guided GANs
DaNet: Decompose-and-aggregate Network for 3D Human Shape and Pose Estimation
See Through the Windshield from Surveillance Camera Images
A Unified Multiple Graph Learning and Convolutional Network Model for Co-saliency Estimation
Joint Adversarial Domain Adaptation
What I See Is What You See: Joint Attention Learning for First and Third Person Video Co-analysis
Improving the Learning of Multi-column Convolutional Neural Network for Crowd Counting
Imbalance-aware Pairwise Constraint Propagation
Data Priming Network for Automatic Check-Out
TGG: Transferable Graph Generation for Zero-shot and Few-shot Learning
Video Relation Detection with Spatio-Temporal Graph
GroundNet: Monocular Ground Plane Estimation with Geometric Consistency
Fewer-Shots and Lower-Resolutions: Towards Ultrafast Face Recognition in the Wild
Black-box Adversarial Attacks on Video Recognition Models
Outfit Compatibility Prediction and Diagnosis with Multi-Layered Comparison Network
TC-GAN: Triangle Cycle-Consistent GANs for Face Frontalization with Facial Features Preserved
Hybrid Image Enhancement With Progressive Laplacian Enhancing Unit
3D Singing Head for Music VR: Learning External and Internal Articulatory Synchronicity from Lyric, Audio and Notes
Joint Rotation-Invariance Face Detection and Alignment with Angle-Sensitivity Cascaded Networks
Asynchronous Tracking-by-Detection on Adaptive Time Surfaces for Event-based Object Tracking
Adversarial Seeded Sequence Growing for Weakly-Supervised Temporal Action Localization
Zero-Shot Restoration of Back-lit Images Using Deep Internal Learning
Mixed-dish Recognition with Contextual Relation Network
Perceptual Visual Reasoning with Knowledge Propagation
Kindling the Darkness: A Practical Low-light Image Enhancer
Instance of Interest Detection
Visual Relation Detection with Multi-Level Attention
Multi-modal Multi-layer Fusion Network with Average Binary Center Loss for Face Anti-spoofing
Robust Subspace Discovery by Block-diagonal Adaptive Locality-constrained Representation
DADNet: Dilated-Attention-Deformable ConvNet for Crowd Counting
Ranking Video Salient Object Detection
Cross-Fiber Spatial-Temporal Co-enhanced Networks for Video Action Recognition
Crowd Counting via Multi-layer Regression
A Single-Shot Arbitrarily-Shaped Text Detector based on Context Attended Multi-Task Learning
Gesture-to-Gesture Translation in the Wild via Category-Independent Conditional Maps
BraidNet: Braiding Semantics and Details for Accurate Human Parsing
Learning Semantics-aware Distance Map with Semantics Layering Network for Amodal Instance Segmentation
Mocycle-GAN: Unpaired Video-to-Video Translation
POINet: Pose-Guided Ovonic Insight Network for Multi-Person Pose Tracking
Long Short-Term Relation Networks for Video Action Detection
Adaptive Feature Fusion via Graph Neural Network for Person Re-identification
Illumination-Invariant Person Re-Identification
Co-saliency Detection Based on Hierarchical Consistency
Sparse Temporal Causal Convolution for Efficient Action Modeling
Prediction-CGAN: Human Action Prediction with Conditional Generative Adversarial Networks
FashionOn: Semantic-guided Image-based Virtual Try-on with Detailed Human and Clothing Information
Ground-Aware Point Cloud Semantic Segmentation for Autonomous Driving
Training Efficient Saliency Prediction Models with Knowledge Distillation
Video Text Detection by Attentive Spatiotemporal Fusion of Deep Convolutional Features