CVPR2013: Activity Recognition

There are a lot of papers at CVPR. The following summary is based on the 37 papers I found related to activity analysis/recognition at CVPR 2013. This represents 8% of the proceedings (total=472 papers). There were also two workshops on activity recognition (ACTS and HAU3D) and one on scene understanding (SUNw); this writeup does not discuss those. For a more general overview of the conference papers see the blog entries of Thomasz Malisiewicz and Andrej Karpathy. PDFs of all of the papers can be found here.

For NIPS2012 Andrej‘s did an interesting analysis using LDA to cluster the proceedings into different categories. Inspired by this idea — to make it easier to sort through relevant papers — I did this for CVPR2013. Go here for clustering of all of the CVPR papers and here for clustering of only the action recognition papers. Latent Dirichlet Allocation is used to infer a set of topics for each paper. The topics for the first analysis (all of the papers) appears to segment out the different categories (e.g. dictionary learning, geometric models, scene analysis, etc) much more clearly than the topics for the action-only data. I also tried using HDP-LDA to automatically infer the number of topics in the action dataset but got a worse looking set of topics. To highlight it’s effectiveness, I used it after manually going through all the CVPR titles and ended up finding a few action papers that I had missed. The site also contains links to all of the PDFs and other useful information to help sort through the papers.

Overall, the action papers can be divided into the following categories: new features [10x], poselets and parts-based models [6x], actions sequences [20x], and one application. At the end is a listing of each paper in it’s category with a brief set of descriptions. After that is a reverse indexing of the datasets used in each paper. The papers in bold are ones that intrigued me based on reading the abstracts and glancing through each paper. If you think any of these are incorrectly categorized let me know.

Features:

2D features

I was a little disappointed with most of the 2D feature papers. On the surface most of them appear to be tweaks of techniques that has been around for a while. Given the recent popularity of deep learning I’m surprised that there aren’t more methods for unsupervised discovery of features for actions. There were couple of papers that intrigued me. The first, Dynamic Scene Classification: Learning Motion Descriptors with Slow Features Analysis, uses Slow Features Analysis for an unsupervised method of learning motion descriptors for scenes with the goal of video-based scene classification. The idea of SFA is to learn a model for a slowly varying signal from fast, noisy one. [edit: I now see this is actually based on using SFA for actions in PAMI2012].

The second is Representing Videos Using Mid-level Discriminative Patches [1] . I wasn’t sure whether to put into the features or short actions category. Their paper describes a new system for action recognition but the focus is on representation. They develop a higher-level feature that can be used as an action primitive. Unlike most spatio-temporal features that use sub-second timescales, Jain et. al. look at scales of 50 frames or longer. Their approach is in essence an extension of exemplar SVMs for actions. The motivation is that the clustering step in traditional bag of words models introduces a lot of error and ambiguity — using e-SVMs skirts around this. They use cuboids as their base feature and use per-frame object locations and human pose as annotations for learning (note: these are not  inferred at test time). One thing I like about this paper is that they discuss action “explanation” for object localization and fine-grained action details by doing label transference from the training data.

Screen Shot 2013-07-11 at 9.47.44 AM

[1] Representing Videos Using Mid-level Discriminative Patches
Arpit Jain, Abhinav Gumpta, Mikel Rodriguez, Larry S. Davis

3D Features

Most publicly available 3D action datasets consist of simple actions or gestures in constrained environments. Inspired by Hollywood2, the paper Hollywood 3D: Recognizing Actions in 3D Natural Scenes [2] introduces a new dataset that takes scenes from 3D films like Avatar, Pirates of the Caribbean, and 12 other movies and decomposes clips into color and depth images. While each of the clips is short in duration, these are much more varied than current datasets. This paper also extends many of the current common spatio-temporal feature techniques into the 4D domain (RGBD+time).

Screen Shot 2013-07-11 at 11.18.17 AM
[2] Hollywood 3D: Recognizing Actions in 3D Natural Scenes
Simon Hadfield, Richar Bowden

It is interesting to see that despite taking very different approaches, the papers on HON4D (Histogram of Spatio-temporal Normals over time) and Spatio-temporal Depth Cuboids both achieve approximately 89% accuracy on the MSR Action3D dataset.

Poselets and part-based models:

Poselets use part-based detectors to find the major body parts of a human. This is an idea that was first introduced a few years ago by Jitendra Malik’s group and aims to create a higher level feature representation than the commonly used STIPs or other spatio-temporal methods. One of the reasons I think more people should be working with skeletal data (e.g. via a Kinect) is that I believe soon we will be able to rely heavily on having human skeletons in most action data. These works are helping to push that boundary. This year there are two papers that look at actions from still images using poselets and 4 using video.

The paper Spatiotemporal Deformable Part Models for Action Detection (SDPM) extends per-frame deformable deformable part models into the temporal domain and discriminatively find the most important 2.5D subvolume for each action. Poselet Key-Framing: A Model for Human Activity Recognition does max pooling on poselets to get a per-frame feature descriptor and then effectively doing temporal segmentation to find salient key-frames from each video using a Structural SVM. I like that An approach to pose-based action recognition [3]  focuses on the skeleton structure and develops a representation useable for both poselet models and general skeleton data like from a Kinect. They even test on both 2D and 3D data.

It appears that because of the complexity of simply getting good poselets the action models tend to be very simple — many of them are some variation of Bag of Words. However, they still tend to do very well in terms of accuracy. For example SDPM gets 100% accuracy on Weissmann dataset [note: this might be common?] and does much better than non poselet-based methods on two other datasets as well.  Despite their promise, there is still a lot of work left before this kind of thing can be used in practice. One of the big problems is that they are typically very slow to run.

 
Screen Shot 2013-07-11 at 2.28.29 PM
[3] An Approach to Pose-Based Action Recognition 
Chunyu Wang, Yizhou Wang, and Alan L. Yuille 

There are also two interesting papers that look at people and their social roles in images (not included in this listing). They use 3D location, poselets and other properties of body shape to infer social roles:

3D Visual Proxemics: Recognizing Human Interactions in 3D from a Single Image
Ishani Chakraborty, Hui Cheng, Omar Javed 
 
Social Role Discovery in Human Events
V. Ramanathan, B. Yao, and L. Fei-Fei.
 
Actions/Sequences:

Over half of  the action-related papers develop new models for actions or sequences of actions. There are many types of models here but for generality I am lumping them all together. These include simple/short actions, long sequences, egocentric videos, and group/social actions. Most of these are discriminative (using SVMs or max-margin techniques) but a handful introduce interesting generative approaches.

While short actions — think KTH or any other dataset where videos are <10 seconds and contain one action — can be interesting to look at from an action primitive standpoint, for most practical applications it is important to factor problems like temporal segmentation into your model. While I will focus on long action sequences, there are a handful of interesting short action papers in the proceedings including Bilinear Programming for Human Activity Recognition with Unknown MRF Graphs. This paper creates a graph over different objects in an images (e.g. tennis players in a match) and uses that as context for the action model.

One that I am eager to read is A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching [4] by Jason Corso’s group at Buffalo. Their technique uses topic models and NLP techniques to create a written summary of the set of actions that occur in a video. In consists of a low level multi-modal topic model over visual descriptors (with HOG3D), mid level concept detectors using deformable parts models to find actions and context, and high level semantic verification to ensure the whole thing makes sense. It’s a pretty complex model but is formed using many theoretically principled ideas. I spent part of my undergrad doing research in Jason’s lab so I am interested to read more.

Screen Shot 2013-07-11 at 8.14.17 PM
[4] A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching
Pradipto Das*, Chenliang Xu*, Richard F. Doell and Jason J. Corso 
 

There is a trend for trying to develop techniques that fuse global+local techniques. The idea is that global methods like bag of words can give general context but don’t provide fine grained, temporally localized actions. Local methods are better for describing sub-actions and details but often get confused because of their lack of context. By developing more sophisticated models we are often trying to obtain specific pieces of context without resorting to pulling in the kitchen sink via BoW. One paper I liked in this direction was Context-Aware Modeling and Recognition of Activities in Video [5]. The premise in this work is that things that are related both spatially and temporally are usually dependent on each other. They learn the duration, motions, and context for each action using structured prediction. In the image below the person and the car are related because they are close in proximity and there is a temporal relationship.

Screen Shot 2013-07-12 at 11.13.34 AM
[5] Context-Aware Modeling and Recognition of Activities in Video
Yingying Zhu Nandita M. Nayak Amit K. Roy-Chowdhury 

There are a handful of papers that  look at actions using egocentric videos. I wouldn’t be surprised if we see more of these papers in the near future given technologies like Google Glass and GoPro. The idea is to summarize long periods of video from a hand-held (or wearable) camera. One paper that stuck out was  Story-Driven Summarization for Egocentric Video [6] which tries to find a set of subshots that best depicts the essential parts of a video. They use an MRF-based model with three potentials: a “story” potential that seeks coherency between shots (objects should consistently be in a shot and should not move in and out of the shot), an “importance” potential that looks at how prominent objects are in a scene, and a “diversity” potential to make sure not all of the shots are of the same scene. I’m not especially familiar with the egocentric literature, so it’s possible people have already been doing this, but I think their idea picking out objects in the scene as a frame of reference for how things change is especially interesting.

Screen Shot 2013-07-12 at 10.25.08 AM

[6] Story-Driven Summarization for Egocentric Video
Zheng Lu and Kristen Grauman 

There are two other papers that stuck out in my head. The first is Action Recognition by Hierarchical Sequence Summarization [7] which recursively alternates between sequence learning and sequence summarization using a latent CRF model to recognize actions in long videos. Their approach of stacking up latent variables and using nonlinear gate functions reminds me of some recent work in Deep Learning. The second, Online Dominant and Anomalous Behavior Detection in Videos, develops an unsupervised approach to decompose actions into separate categories for dominant and anomalous behaviors. For example, if you are looking at a busy pathway with lots of people then the dominant action is people walking and an anomalous action would be a care driving through.

Screen Shot 2013-07-12 at 11.45.40 AM

[7] Action Recognition by Hierarchical Sequence Summarization
Yale Song, Louis-Philippe Morency, Randall Davis

Datasets

I often find it useful to look at the data papers are using before reading their method. This can give a better indication of the actual problem someone is trying to solve. For example, if you see a paper using KTH then they likely aren’t really dealing with complex actions. See the list at the bottom for an indexing of dataset to paper. Note that this list was generated automatically (with some manual tweaking) by matching dataset names to words in the papers. This means that there may be some mistakes.

It’s interesting to see the long tail of datasets. There are 30 datasets that are only used by a single paper. In terms of popularity, KTH [10x] is still the most used for techniques working with simple actions followed by UCF Sports [5x]. Trecvid [4x] appears to b the most common for event retrieval and UT-Interaction [3x] is most common for complex, group actions. MSR DailyActivity3D [3x] is the most popular 3D action dataset. On average each paper uses about 2 datasets (79 dataset instances, 38 papers).

What is the reason for this long tail? It makes sense to develop a new dataset to solve a completely new problem but there is large overlap between some of these datasets. For example, it seems that there are at least 4 short action Kinect datasets where a single user is standing in front of the device doing some sort of action. This makes it harder to compare methods over time. If your application is really novel, then sure create (and share!) new data, but if there is similar data out there don’t bother doing something new.

New datasets introduced: 
– Hollywood 3D [650 video clips, 14 classes] with stereo color + depth data,
– Manipulation Action Consequences dataset
– EVVE Event database
– YouCook dataset
– MSR Action Pairs dataset
– UT-Kinect-Action dataset

Thoughts and questions

In agreement with both Thomasz and Andrej, it is increasingly important to work as a higher level if we want to obtain general scene understanding. If we want to do this then we need better high-level representations. For example, with humans we should be working with skeletons, not sets of pixels. There are so many features for pixels — why so few for other data structures (running SIFT on a skeletal data is non-sensical). It may be important for context to look at pixels neighboring each skeletal joint, but overall we should look more at joint interactions and patterns at the skeletal level. There are a number of papers using depth data — including a few introducing new 3D features — yet only one that I noticed developed new features using the skeleton. I do see some of this happening in the poselet papers but not as much in papers using Kinect data.

There are a good chunk of papers that provide code and data. While I am grateful when authors make this available, often it seems to go unused or unsupported. Often it takes longer to get someones’ code to work than it does to reimplement the model yourself. Perhaps there should be a bigger push for people to include their code in one of the major vision libraries. I guess a key problem stems from the various languages people use to code — there are three main camps: hard core C++ people, Python people, and Matlab people. I guess ideally stuff should all be ported to C++ with bindings to Python and Matlab. For sake of reproducibility it would be interesting if you were forced to include code with all CVPR submissions.

Additionally, it seems like there are a lot of separate threads going on that don’t quite intersect. For example, most of the poselet papers use very simple models for activity recognition. When combined with some of the more complex models in the ‘long actions’ category of papers it seems like they could become very powerful. If more code was available this would be easier to do. In particular I’m thinking of techniques like structured prediction which typically require a lot of knowledge to get started with (maybe pyStruct will help?).

The idea of sharing implementations also means that we should consider sharing output data too. For example, I’m aware that in the Chalearn Gesture Recognition Competition last year they had downloaded data for raw color/depth images and for the STIPs output from the color data. Perhaps this mid-level data could be easier to host instead of actual code?

Overall it seems that people are making steady progress in this area. I noticed more papers on structured prediction, high-level representations, and techniques for using relevant context than in the past. Another thing I noticed is the lack of deep learning and fancy generative models like bayesian non-parametrics for activity recognition. Most of the papers I see in these directions go to more machine learning-themed conferences like ICML and NIPS. Given recent successes in other areas I would love to see more of them.

 

 

————————————————————————–

Indexing by Category

Features (10x)

2D Features

Evaluation of Color STIPs for Human Action Recognition
– features, STIPs… but using RGB data
 
Better Exploiting Motion for Better Action Recognition
– features , short actions, differential motion, decompose into dominant + residual motion, vlad coding,
Datasets: hollywood2, hmdb51, olympic sports
 
Sampling Strategies for Real-Time Action Recognition
– features, dense random sampling from local spatio-temporal features, multi-channel
 
3D R Transform on Spatio-temporal Interest Points for Action Recognition
– features, global geometrical distribution of interests points, 3D discrete radon transform, svm with new kernel: pairwise similarity and higher-order contextual interactions
 
Motionlets: Mid-Level 3D Parts for Human Motion Recognition
– feature, cluster in motion and appearance corresponding to movement of body parts, process: extract 3D regions with high saliency, closer candidate templates, use greedy method to select effective candidates
Dataset: KTH, HMDB51, UCF50
 
Dynamic Scene Classification: Learning Motion Descriptors with Slow Features Analysis
– feature, unsupervised learning of motion features, introduce learned local motion descriptor using Slow Feature Analysis, generate signature for each video using global pooling architecture.
Dataset: Large Scale Visual Recognition Challenge
 
Representing Videos Using Mid-level Discriminative Patches
– spatio-temporal patches that correspond to action primitives, exemplar-based clustering (Examplar-SVM/e-SVM), sample patches from image (and prune), rank by appearance and purity, use label transfer for object localization and finer level action detection
Dataset: UCF50, Olympics

3D Features

Hollywood 3D: Recognizing Actions in 3D Natural Scenes
– features, new dataset, 3D, BOW, extends 2D features to 3D
 
HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences
– features, new dataset, 3D, action pairs, histogram of temporal normals, projections
 
Spatio-temporal Depth Cuboid Similarity Feature for Activity Recognition Using Depth Camera
– features, depth cuboid similarity feature (DCSF), a depth-version of STIPs/cuboids

Poselets/Parts-based models (6x)

Videos
Spatiotemporal Deformable Part Models for Action Detection
– short actions, spatiotemporal action is a deformable part model, spatio-temporal sub-volumes are parts, latent svm
 
An Approach to Pose-Based Action Recognition
– short actions, improves body pose estimation, 3D, groups pose into 5 core parts, BoW for actions
 
Poselet Key-Framing: A Model for Human Activity Recognition
– actions, sparse sequence of poselets, max-margin, structural svm, max pooling on detectors
 
Unconstrained Monocular 3D Human Pose Estimation by Action Detection and Cross-Modality Regression Forest
– joint human pose estimation and activity recognition, parts-based mode, uses kinect skeleton as ground truth, random forest on appearance features for output of action and pose distribution
 
Still images
Expanded Parts Model for Human Attribute and Action Recognition in Still Images
– images, attributes and actions, recompute score with reconstructed regions, optimization
 
Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
– images, pose lets, semi-supervised, action categories, synthesize poses
 
Actions/Sequences (20x)

Short actions:

Bilinear Programming for Human Activity Recognition with Unknown MRF Graphs
– short actions, group actions, learn edges in MRF, MAP inference, latent sSVM, tennis: players are nodes, tv: people are nodes
 
Multi-task Sparse Learning with Beta Process Prior for Action Recognition
– short actions, MTSL constructs test sample from multiple features using sparse dictionary with as few bases as possible, beta process prior for sparsity, gibbs sampling for full posterior
Dataset:kth, ucf sports
 
Cross-View Action Recognition via a Continuous Virtual Path
– short actions, multi-view, continuous virtual path connects source and target views, points on path are obtained by linear transformation on action descriptor, uses cuboids
Dataset: IXMAS
 
Robust Canonical Time Warping for the Alignment of Grossly Corrupted Sequences
-general, robust variation of dynamic time warping for actions, rank minimization and compressed sensing
dataset: kth
 
Action Recognition by Hierarchical Sequence Summarization
– short actions, alternates sequence learning and summarization, latent CRF, gate functions,

Long actions:

A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching
– long actions, language models, topic model, semantic verification
 
Online Dominant and Anomalous Behavior Detection in Videos
– long actions, unsupervised, detect actives anomalies and objects,video parsed into background dominant activities and rare activities, hierarchical codebook for dominant behaviors, contextual graph,
Dataset: UCSD pedestrian, subway surveillance
 
Context-Aware Modeling and Recognition of Activities in Video
– surveillance, low level motion segmentation (nonlinear dynamical system) and high level action framework, greedy inference, max margin, model learns duration motion and context, potentials: activity-duration interactivity motion intra-activity context and inter-activity context
Dataset: VIRAT Ground
 
Augmenting Bag-of-Words: Data-Driven Discovery of Temporal and Structural Information for Activity Recognition
– long actions, data-driven, random regular expressions, surgical skill evaluation
 
Modeling Actions through State Changes
– long actions, recognition, segmentation, state detectors, segments changed region to classify
 
Recognize Human Activities from Partially Observed Videos
– long actions, divide each segment into ordered temporal segments, extract features from segments, apply sparse coding, find likelihood of each segment and global likelihood. Do prediction and gap filling.
Dataset: UT-Interaction, DARPA Y2
 
Groups (2x)
Finding Group Interactions in Social Clutter
– group actions, social, individual and pairwise descriptors, compare instance to exemplar, temporal and spatial localization
 
Multi-Agent Event Detection: Localization and Role Assignment
– group actions, detect people and classify events jointly with role assignment
Dataset: VIRAT, other ad-hoc data
 
Egocentric (3x)
Story-Driven Summarization for Egocentric Video
– long actions, create story-driven summaries for long unedited videos, create chain of sub shots that best depict the video, potentials: store, importance, diversity, use linear MRF where a node is an 11 frame interval, each subshot is 15 seconds on average, videos are 4 hours
Dataset: UT Egocentric, Activities of Daily Living
 
First-Person Activity Recognition: What Are They Doing to Me?
– ego-centric, interaction-level human activities, multi-channel kernel SVM, hierarchical structure learning
 
Detection of Manipulation Action Consequences (MAC)
– manipulation actions, primitive action consequences for high-level classification of manipulation actions, active tracking and segmentation, visual semantic graph
 
Events (3x)
Event Recognition in Videos by Learning from Heterogeneous Web Sources
– events, do video classification of consumer video using only a separate set of weakly labeled images/videos, multi-kernel learning
Dataset: Kodac, CCV
 
Event retrieval in large video collections with circulant temporal encoding
– events, given a clip find another similar clip in large database, per frame detector with temporal ordering, circulant matrices to compare in frequency domain, introduces new dataset EVVE
Dataset: Trecvid, CCWeb, EVVE
 
Complex Event Detection via Multi-Source Video Attributes
– events, extend the idea of ‘relative attributes’ to videos, attributes: semantic label of simple external labeled video, Multi-level collaborative regression
Dataset: Trecvid med 2012
 

Applications (1x):

Decoding Children’s Social Behavior
– dataset, multi-modal activity recognition, analysis of social and communication behavior with 3D video and audio, introduce Multimodal Dyadic Dataset of adult-child interactions
 

Indexing by Datasets

Dataset: kth [10 papers]
Multi-agent Event Detection: Localization and Role Assignment
Robust Canonical Time Warping for the Alignment of Grossly Corrupted Sequences
Online Dominant and Anomalous Behavior Detection in Videos
Sampling Strategies for Real-Time Action Recognition
Spatiotemporal Deformable Part Models for Action Detection
Motionlets: Mid-level 3D Parts for Human Motion Recognition
Unconstrained Monocular 3D Human Pose Estimation by Action Detection and Cross-Modality Regression Forest
3D R Transform on Spatio-temporal Interest Points for Action Recognition
Multi-task Sparse Learning with Beta Process Prior for Action Recognition
 
Dataset: ucf sports [5 papers]
Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
Evaluation of Color STIPs for Human Action Recognition
Spatiotemporal Deformable Part Models for Action Detection
3D R Transform on Spatio-temporal Interest Points for Action Recognition
Multi-task Sparse Learning with Beta Process Prior for Action Recognition
 
Dataset: ucf50 [4 papers]
Evaluation of Color STIPs for Human Action Recognition
Representing Videos Using Mid-level Discriminative Patches
Sampling Strategies for Real-Time Action Recognition
Motionlets: Mid-level 3D Parts for Human Motion Recognition
 
Dataset: trecvid [4 papers]
A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching
Story-Driven Summarization for Egocentric Video
Event Retrieval in Large Video Collections with Circulant Temporal Encoding
 
Dataset: hmdb51 [4 papers]
Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
Better Exploiting Motion for Better Action Recognition
Sampling Strategies for Real-Time Action Recognition
Motionlets: Mid-level 3D Parts for Human Motion Recognition
 
Dataset: ut-interaction [3 papers]
Recognize Human Activities from Partially Observed Videos
Finding Group Interactions in Social Clutter
Poselet Key-Framing: A Model for Human Activity Recognition
 
Dataset: weizmann [3 papers]
Spatiotemporal Deformable Part Models for Action Detection
Unconstrained Monocular 3D Human Pose Estimation by Action Detection and Cross-Modality Regression Forest
 
Dataset: msr action3d [3 papers]
An Approach to Pose-Based Action Recognition
Spatio-temporal Depth Cuboid Similarity Feature for Activity Recognition Using Depth Camera
HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences
 
Dataset: hollywood2 [2 papers]
Hollywood 3D: Recognizing Actions in 3D Natural Scenes
Better Exploiting Motion for Better Action Recognition
 
Dataset: activities of daily living [2 papers]
Story-Driven Summarization for Egocentric Video
First-Person Activity Recognition: What Are They Doing to Me?
 
Dataset: stanford 40 actions [2 papers]
Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
Expanded Parts Model for Human Attribute and Action Recognition in Still Images
 
Dataset: olympics [2 papers]
Better Exploiting Motion for Better Action Recognition
Representing Videos Using Mid-level Discriminative Patches
 
Dataset: virat [2 papers]
Multi-agent Event Detection: Localization and Role Assignment
Context-Aware Modeling and Recognition of Activities in Video
 
Dataset: msr dailyactivity3d [2 papers]
Spatio-temporal Depth Cuboid Similarity Feature for Activity Recognition Using Depth Camera
HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences
 
Dataset: ucf11 [1 papers]
Evaluation of Color STIPs for Human Action Recognition
 
Dataset: ucf13 [1 papers]
Representing Videos Using Mid-level Discriminative Patches
 
Dataset: ixmas [1 papers]
Cross-View Action Recognition via a Continuous Virtual Path
 
Dataset: hollywood 3d [1 papers]
Hollywood 3D: Recognizing Actions in 3D Natural Scenes
 
Dataset: ucsd pedestrian [1 papers]
Online Dominant and Anomalous Behavior Detection in Videos
 
Dataset: evve [1 papers]
Event Retrieval in Large Video Collections with Circulant Temporal Encoding
 
Dataset: canal9 [1 papers]
Action Recognition by Hierarchical Sequence Summarization
 
Dataset: natops [1 papers]
Action Recognition by Hierarchical Sequence Summarization
 
Dataset: ocean city [1 papers]
Augmenting Bag-of-Words: Data-Driven Discovery of Temporal and Structural Information for Activity Recognition
 
Dataset: darpa y1 [1 papers]
Recognize Human Activities from Partially Observed Videos
 
Dataset: darpa y2 [1 papers]
Recognize Human Activities from Partially Observed Videos
 
Dataset: visual proxemics [1 papers]
3D Visual Proxemics: Recognizing Human Interactions in 3D from a Single Image
 
Dataset: kodak [1 papers]
Event Recognition in Videos by Learning from Heterogeneous Web Sources
 
Dataset: ccv dataset [1 papers]
Event Recognition in Videos by Learning from Heterogeneous Web Sources
 
Dataset: youtube dataset [1 papers]
Event Recognition in Videos by Learning from Heterogeneous Web Sources
 
Dataset: mer12 [1 papers]
A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching
 
Dataset: youcook [1 papers]
A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching
 
Dataset: ut egocentric [1 papers]
Story-Driven Summarization for Egocentric Video
 
Dataset: willow 7 human actions [1 papers]
Expanded Parts Model for Human Attribute and Action Recognition in Still Images
 
Dataset: database of human attributes [1 papers]
Expanded Parts Model for Human Attribute and Action Recognition in Still Images
 
Dataset: armgesture [1 papers]
Action Recognition by Hierarchical Sequence Summarization
 
Dataset: keck [1 papers]
An Approach to Pose-Based Action Recognition
 
Dataset: msr gesture3d [1 papers]
HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences
 
Dataset: msr action pairs [1 papers]
HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences
 
Dataset: manipulation action consequences [1 papers]
Detection of Manipulation Action Consequences (MAC)
 
Dataset: badminton match [1 papers]
Bilinear Programming for Human Activity Recognition with Unknown MRF Graphs
 
Dataset: action-pose-estimation [1 papers]
Unconstrained Monocular 3D Human Pose Estimation by Action Detection and Cross-Modality Regression Forest
 
Dataset: large scale visual recognition [1 papers]
Dynamic Scene Classification: Learning Motion Descriptors with Slow Features Analysis
 
Dataset: tvhi [1 papers]
Bilinear Programming for Human Activity Recognition with Unknown MRF Graphs
 
Dataset: UT Kinect Action [1 papers]
Spatio-temporal Depth Cuboid Similarity Feature for Activity Recognition Using Depth Camera
 

Python wrapper for Quickshift

I’ve been playing around with a couple of super pixel algorithms lately: namely SLIC and Quickshift. In general I use python for all of my vision work, so I wanted an interface for testing these on my data. Andreas Mueller has provided a SLIC python wrapper and a wrapper for the CUDA implementation for quickshift.

Unfortunately my computer doesn’t have a Nvidia graphics card, so I was out of luck with the CUDA implementation. However, I was able to create a wrapper using the cpu-based VLFeatimplementation.

The code is on github: https://github.com/colincsl/pyQuickshift

Image

Original Image Superpixels (using average color of each region)

–Requirements–

Python 2.x (tested with 2.7)
Numpy 1.x (tested with 1.7)
Cython
VlFeat (tested with 0.9.14)

–Optional requirement–

Image python module
IPython

–Compilation–

  1. Download VLFeat from http://www.vlfeat.org/
  2. Put the pyQuickshift folder in ../vlfeat/vl/
  3. Run the following in the pyQuickshift folder: python setup.py build_ext –inplace

–Example Usage–

# Run this in ipython with command “ipython qtconsole –pylab” to show output image

import Image
import numpy as np
import pyQuickShift as qs
imgRaw = Image.open('/Users/colin/libs/vlfeat/data/a.jpg')
imgRGB = np.array(imgRaw, dtype=uint8)
img = np.ascontiguousarray(imgRaw, dtype=np.double)
if 0: # 3 channel
 labels, dists, density = qs.quickshift_3D(img*.5, 2, 20) # Image, kernel size, maxDistance
else: # 1 channel
 im2D = np.ascontiguousarray(img[:,:,2], dtype=double)
 labels, dists, density = qs.quickshift_2D(im2D*.5, 2, 20) # Image, kernel size, maxDistance
print "There are", len(unique(labels)), "superpixels"
#Paint the superpixels with their average color
if len(unique(labels)) < 10000:
 imgColor = np.empty_like(imgRGB)
 for l in unique(labels):
 imgColor[labels==l] = imgRGB[labels==l].mean(0)
 imshow(imgColor)

YAKS: Yet Another Kinect Setup Guide for Macs

There are a number of guides online [Kinectthesis][KBM][CB] that attempt to help you get started with the Xbox Kinect. Yet, somehow it still took over a days worth of work to get it working properly on my computer running OSX 10.7 (Lion). In order to be complete, I’ll run through everything you need to get settled… but honestly other people have done most of the work here. The real additions are near the end. I’ll talk about how I finally got it working on my computer and mention some places to look to help you debug in case you have other problems.

Follow the following steps, exactly, in order:

  • Setup XCode (comes on OSX cd or through the App Store if you’re on Lion)
  • Install MacPorts
  • In terminal: sudo port install libtool
  • In terminal: sudo port install libusb-devel +universal
  • Download OpenNI here. Pick the unstable MacOSX 10.6 (it doesn’t matter if you’re on 10.7) 
  • Download the Kinect Drivers. Click ‘Downloads’ near the top-right.
  • Download the PrimeSense NITE software. This is for skeleton tracking.
  • For convenience you can extract and put all of the 3 downloads into a single folder

ie:
/Colin/Code/Kinect
     –>/OpenNI
     –>/avin-sensor
     –>/NITE 

Install OpenNI:

  • cd into /OpenNI
  • In terminal: sudo ./install.sh

Install Kinect Sensor drivers:

  • cd into /avin-sensor/Bin
  • Extract SensorKinect-Bin-MacOSX-XXXX.tar.bz2
  • cd into this folder
  • In terminal: sudo ./install.sh

Install NITE (body tracking):

  • cd into /NITE
  • In terminal: sudo ./install.sh
  • Enter password: 0KOIk2JeIBYClPWVnMoRKn5cdY4= (it’s the same password for everyone)

XML file setup: You must setup a handful of config files so that the drivers know what to try to connect to and display.

Open the following files:

  • {/NITE/Data/} Sample-Scene.xml, Sample-User.xml, Sample-Tracking.xml
  • Scroll down on this site to get the appropriate XML files

Ok, now here is where things go differently than other tutorials. There were two problems on my computer:

  1. There is an additional XML file that needs to be changed
  2. There are some files that had to be copied from the avin-sensor package and placed in the OpenNI Samples directory.

(1) There is a configuration file for the OpenNI samples necessarily to display video to screen. Replace the file …/OpenNI/Samples/Config/SamplesConfig.xml with the following XML. Take note of the LogLevel value. This makes it much easier to debug when things don’t work. Setting it to level “0,” Verbose, allowed me to figure out what was going wrong with my installation. I could see that there were problems with certain dynamic libraries not being loaded correctly. Set to “3” if you don’t want to clog your terminal and you only want to see errors. 

<!– …/OpenNI/Samples/Config/SamplesConfig.xml–>

<OpenNI> 

<License vendor=”PrimeSense” key=”0KOIk2JeIBYClPWVnMoRKn5cdY4=”/>
<Log writeToConsole=”true” writeToFile=”false”>
<!– 0 – Verbose, 1 – Info, 2 – Warning, 3 – Error (default) –>
<LogLevel value=”2″/>
  <Masks>

    <Mask name=”ALL” on=”true”/>

</Masks>
  <Dumps> </Dumps>

</Log>
 

<ProductionNodes>
<Node type=”Depth” name=”Depth1″>
  <Configuration>

    <MapOutputMode xRes=”640″ yRes=”480″ FPS=”30″/>

    <Mirror on=”true”/>

</Configuration>
</Node>
<Node type=”Image” name=”Image1″ stopOnError=”false”>
  <Configuration>

    <MapOutputMode xRes=”640″ yRes=”480″ FPS=”30″/>

  <Mirror on=”true”/>

  </Configuration>

</Node>
<!– <Node type=”Audio” name=”Audio1″/>–>
<!–<Node type=”User” />–>s
     </ProductionNodes>
</OpenNI>

(2) The second problem was with the dynamic libraries NiViewer was trying to load. From the aforementioned “verbose” output I could see that libOpenNI.dylib and some other files could not be found. To fix this I copied them from /avin-sensor/Lib to /OpenNI/Samples/Bin/Release. Note that I also had to move these to the /NITE/Samples/Bin/Release as well to run those samples. There may be a more elegant solution to this. If/when I find it, I will update this post.

Uninstalling

Unless you’re install went amazingly well and actually worked, you may need to uninstall OpenNI/Sensor/NITE at some point. This is pretty simple and was documented here. Again, for completeness I’ll reiterate here. 

  • NITE: cd to the NITE folder (…/kinect/NITE) and in the terminal run: sudo ./uninstall.sh
  • Sensor: cd to the Mac folder within sensor (…/kinect/avin-sensor/bin/SensorKinect-Bin-MacOSX-vXXX/ and in the terminal run: sudo ./install.sh -u
  • OpenNI: cd to the OpenNI folder (…/kinect/OpenNI/) and in the terminal run: sudo ./install.sh -u
Other Kinect Resources:
  • Point Cloud Library (PCL): a library for working with and extracting information from pointsclouds/Kinect data. It’s maintained by people at Willow Garage and has methods for segmentation, detection, and other functionality.
  • OpenNI Arena: games/apps that people have written for the Kinect.


Resources for Self-guided Study in Computer Vision

Computer science has it easy. The web is littered with useful information from online forums to books to open courseware. It’s hard not to get overwhelmed by the wealth of knowledge. While there are numerous other postings around surveying all of the places to get open courseware or PDFs of important papers [Books], [Sites/courseware], [Papers], I thought I would share some of the resources I have previous used or am currently going through in the realm of computer vision and robotics. 

Especially as an undergrad, it’s difficult to delve into an area if you don’t have the appropriate classes at your school. So take some of the following to expand your reach. It has been beneficial for me to start by looking through class lectures/presentations and then jump deeper into the math.

Introductory Resources:

There are a number of fairly different computer vision courses around. Some of them focus more on image processing methods while others are more machine learning oriented. The following courses give what I feel is a good/broad overview of the field. I think a ‘good’ course includes topics on camera models, low-level techniques like edge detection, texture, high-level object recognition, appropriate datastructures (ie quadtrees), model fitting (ie RANSAC), and discussion of other spaces (ie fourier space).

In many (most?) instances, especially in high-level vision, the mathematical models come down to applied machine learning. It’s important to at least have a grasp on the many different types of methods available. While you might not need to know the details of hidden markov models for your interests in low-level segmentation, I think it’s important to have a broad background including many related topics. 

CV ClassIntro to Computer Vision (Stanford; Prof Fei-Fei Li) Fairly standard CV course.

CV Class:  Computer Vision (UIUC; Prof Forsyth) Fairly standard CV course.

ML ClassPractical Machine Learning (Berkely; Prof Michael Jordan) This is where I first started to get interested in machine learning. 

ML Notes: Statistical Data Mining Tutorials (Andrew Moore) Great resource for getting starting with a variety of different ML techniques such as SVMs, Mixture Models, Hidden Markov Models, and many others.

ML ClassPattern Recognition (SUNY Buffalo; Prof Jason Corso) Lecture note links in the calendar. These slides are full of text and are a good reference for both the math and intuition.

There are two ‘live’ courses going on through Stanford in AI and ML which appear to be pretty useful. I haven’t gone through the content but the syllabii look good.

CV BookComputer Vision: Models, Learning, and Inference – This is a great (free!) preprint that leans heavily towards machine learning. Each section provides background on a set of models or machine learning tools involved, and methods of inference. The beginning is an in-depth overview of the necessary probability and machine learning concepts. I just started going through this book but it has been really useful for getting an overview of things like parts models and shape models.

CV BookComputer Vision: Algorithms and Applications – This is more traditionally laid out textbook that is referenced in a number of current Intro to CV classes such as Fei-Fei Li’s above and the current CV course at my school (JHU).


Intermediate/Advanced Resources:

CV ClassLearning-based Methods in Vision (CMU; Prof Alexei Efros) I learned a lot about texture (texton) recognition and some state of the art methods using fancy ML techniques.

CV ClassGrounding Object Recognition and Scene Understanding (CMU; Prof Antonio Torralba) This is an ongoing class focusing on higher level vision. The first lecture looks promising, but I’m not exactly sure what the rest of the class will be like.  

CV Papers is a collection of recent computer vision papers from the top/largest vision conferences. It includes CVPR, ECCV, ICCV, Siggraph, and others. It includes all of the paper titles and appropriate PDFs and project files for a large number of them.

Video Lectures This site hosts thousands of academic videos including many in computer vision. For some conferences, like CVPR2010, they host a lot of videos for the talks. They also have a lot of ML videos for summer schools.

Tech Talks For some conferences, like ICML2011, they host video for most (all?) of the talks from the event. Others, like CVPR2011, only have selected videos. This is a great way to learn about a lot of recent work without solely relying on reading papers.


Play, push boundaries, fail, learn, repeat.

Preface: Most of the posts on the blog will be more technically-orientated. However, to start I would like to provide some insight into my undergraduate years of college.

When I first applied to colleges in the Fall of 2006 I was set on majoring in video, film, or 3D animation. My passion back then was in creating videos — music videos, TV show packages, anything — with complex visual effects. I spent hours every day pouring over my computer in Adobe After Effects, 3DS Max, or other video programs. At the time I thought it was a love of video itself; it wasn’t until later that I realized this obsession wasn’t so transparent. Sure, videos can be fun and creative, but it was really the underlying problem solving and creative nature that is required to develop the distinct visual effects. It wasn’t until I saw these same problems as part of a regional high school engineering competition that I decided to switch my anticipated major to engineering.

Why do I bring this up? Well, it could be because my recent switch from Mechanical Engineering to Computer Science mimics this prior situation. But more importantly I think it provides a good preface to my undergraduate years. It emphasizes my learning style: learn through play. As many people know, I keep myself very busy. Classes are important, but I am much more motivated by working on creative problems. In this post I will discuss my undergraduate years and the opportunities available at my alma mater.

At University at Buffalo resources were plentiful — intellectually, socially, and culturally. On the academic side, there are numerous avenues for performing undergraduate research, working on exciting projects through engineering clubs, and gaining support for other creative activities through places like the Honors College. Socially, as long as you look for it, there are always things to do and people to meet. It’s hard not to find a group that you feel deeply in common with through the numerous clubs and meeting places. Complementary to this, at UB, and more generally in Buffalo, there are many opportunities to explore new cultural ideas. From shows downtown to events like International Fiesta on campus, there is often something going on that can expand your mind.

For now I would like to elaborate on the academic opportunities at UB. As many people will tell you, the adage “College is what you make of it” is true. It is totally possible to slide by in college — even in engineering — without getting much out of it. If you’re smart and don’t care much about grades or know-how then you can skim by. It’s my belief that what you do outside of your classes is what really defines you. The number of opportunities available at UB allows students to blaze past the oftentimes dull classwork. In particular I’m talking about research, clubs, and the various academic programs through the school of engineering.

Play to Learn

Throughout all of college I had some type of affiliation with the undergraduate Robotics Club. Our primary endeavour was to develop a vehicle for the Intelligent Ground Vehicle Competition. We also did a lot of supplementary activities like outreach and introductory projects, but that came second to IGVC. I was the president my Junior year and the secretary my Sophomore year.

In the four years of going to competition we always did fine, but never great. We were always able to qualify, which only about half of the teams are able to do, but we always placed in the teens (out of over 50 vehicles) in the autonomous parts of the contest. Ironically we did worse year-after-year for a couple competitions. For anyone who hasn’t dealt with student engineering competitions, this isn’t too atypical. This often results from either trying to make too large of a change or from faulty hardware. Both of those down years we had severe electronics problems that almost stopped us from competing. This didn’t, however, stop us from learning.

The thing about competitions like this is that they require a good number of students dedicated to the contest. We had plenty of mechanical engineers but too few computer scientists. For a long time there were only two of us handling most of the software — one for system platform stuff, and me for automation. This means that I got to explore a lot of different areas. Over the years I worked on implementing a handful of different algorithms for path planning, localization, and lane detection. There is a big difference between a small group that works intensely and performs OK and and a large team that wins. In many cases the members in the smaller team end up learning more than the individuals on the larger team.

I sometimes hear that “Nobody remembers second place.” Sure, if all you care about is being on top then play to win. However, in my mind playing to learn is not only more satisfying but promotes a positive attitude and is more forward-looking. It is possible to win a robotics competition like this by largely relying on open source software already available. By implementing algorithms on your own you gain an understanding and intuition that you otherwise won’t obtain.

Shoot for the Stars

The principle reason for not pursing the Robotics Club presidency Senior year was to give me time to both step back and throw myself into a new area. In the summer before Senior year I was unsure of what I actually wanted to go into for grad school. I wasn’t sure if I was going to predominately apply to Mechanical Engineering or Computer Science programs. I had a strong interest in robotics, but what angle did I really want to take with it? Over that summer I became very interested in computer vision and machine learning. Did I want to continue that?

One of the best decisions I made Senior year was auditing Dr. Jason Corso’s Bayesian Vision graduate course (CSE573). Not only did I find the course invigorating with it’s coverage of topics like Markov Random Fields and MCMC, but it exposed me to new opportunities through his Visual and Perceptual Machines Lab. I fell in love with the math-heavy, probabilistic models used in this area of computer vision. While it was not ultimately accepted for IROS, I submitted a paper advancing a topic we discussed in this class for application on our IGVC vehicle in the robotics club. The project was fun and writing up the conference paper was a great learning experience. It further pushed me to pursue the realm of vision for grad school.

All mechanical engineers at UB are required to do a group senior design project in their final semester. Oftentimes this results in CAD design and analysis of some mechanical system. I had little interest in doing some industrial project with heat analysis or CAD design. By putting together a project proposal before the class started and going through the leg work I convinced the professor to allow me to work on the Solutions in Perception Challenge — a project that is much more computer science-oriented that mechanical engineering. I didn’t expect my proposal to be accepted, but because of my diligence and enthusiasm I was allowed. I learned an incredible amount on that project which put me a big step ahead in terms of CS experience.

Even in the face of skepticism, try your hardest to leap forward in order to reach your goals. People are nice. If you’re willing to push through with your ideas and show them what you are capable of then you will often be rewarded. Not only do you get to do more of what you want, you will learn more this way.

Opportunities and Failure

From my sophomore year I worked in the Automation, Robotics, and Mechatronics Lab under Dr. Venkat Krovi. In this capacity I gained interest in haptic interaction and robotic mechanisms. My activities here were largely independently driven. I was given supplies and funding to work on projects I found interesting, which expanded both my knowledge and skills. Had it not been for my experiences here I may not have chosen my path into medical robotics. I am happy to have worked in this lab and to have been surrounded by some great people.

in my mind, one of the great things about this independence was that there was room to fail. I am incredibly happy with some of my projects in this lab, but I wouldn’t say they were all successes. For example, I think my self-driven work on statistical methods for haptic skill transfer was slightly misguided. At that point I was greatly intrigued by machine learning and was attempting to incorporate that into my haptics work. The first part was fine: I was trying to replicate what some other researchers had published. However, given my lack of experience in that arena, afterwords I ended up pursuing ideas that didn’t make complete sense. My goal was to develop a technique for combining both position and velocity constraints for haptics guidance. I looked to take an expert’s input procedure, find the important positional and velocity characteristics, and fuse them in order to better teach a novice. Because some of the models I used were essentially blackboxes, I wasn’t able to come to a conclusion on my hypothesis. I don’t want to detail all the things I would have done differently, but will say I should have taken smaller steps and discussed the ideas more in-depth with people experienced in the different sub-areas.

Failure is important. Without failing it is hard to know where you stand and what you are capable of. The important part is learning from the experience and figuring out how you could have done things differently.

Final Remarks

My undergraduate years were great. There were times when I felt overworked and frustrated. But that’s a good thing. Hard work pays off. I learned a lot and had a great experience. I still have a lot of great friends from Buffalo and got some really good experiences. I started activities like rock climbing which I have grown to love. While there were certainly downsides to attending UB like the traditional engineering curriculum, research-first attitude of some professors and harsh restrictions on which classes you need in mechanical engineering, I will try not to dwell on them. I will remember what I learned, and did best:

Play, push boundaries, fail, learn, and repeat.