Thursday, 29 April 2021

GLM single trial weighting for EEG

Weighting single trials in EEG for better precision 

It has been may years since I tried to implement a single trial weighted least squares in LIMO tools. The core idea is to have weights that capture trial dynamics - and that was developed more than 7 years ago. Armed with this algorithm, we can just plug that in to a GLM and that's it, more precision in the parameter estimates (i.e. reflects better the truth) and possibly better group level results. The preprint is finally out! Let me tell you why I think this is a useful approach, and what made it so hard to implement.

Single Trial Weighting

Before looking at how we compute this, let's think of a standard weighting scheme: we look at how data points are distributed to obtain weights, say using the inverse of the variance, and we compute the weighted mean, or parameter estimates in a GLM - the idea is simple, the farther away from the bulk, the less weight.

What happen if you use a naive weighting to time courses? Let's look at a trial slightly out-of-phase (figure 1). At each time point, we look at deviations and apply weights. While the result is a better approximation of the truth, time points are seen as an alternation of 'good' then 'bad' data. Physiologically, this is weird - neurons don't compute things independently at each time point, this is clearly an auto-regressive process - we just cannot ignore that.

Figure 1. Example of a trial out-of-phase (green) compared to the bulk (dark grey). If weighting is computed independently at each time point, we observe an alternation of high and low weights - which would indicate that neural activity is sometimes different and sometimes the same as for other trials.

Our approach assign weights based on the whole trial temporal profile, and how if differs from the bulk. For that we use the Principle Component Projection method developed by Filzmoser et al. (2008). The result are means or parameter estimates that approximate the truth just as well as the naive approach but with a scheme that is biologically meaningful. If the result is the same who cares? well, it looks like the same, but isn't - check the preprint, the type of noise for instance has different impact and this matters. 

Just use weights into a GLM, it's easy they said ...

You might think, once you have weights just plug them in your GLM and compute Weighted Least Squares.  This is what I did, and it worked - parameters estimates were less influences by those outlying trials. So why 7 years? because while estimation was better, I just couldn't get the inference to work. You might argue that 1st level/subject inference isn't that interesting so who cares. I care. This isn't satisfying to have half baked solutions. It took 7 years because life takes over and you get busy into other stuffs, since your ideas keeps failing; but I kept at it - occasionally testing something different. 

The difficulty for inference is that LIMO tools like other software use a massive univariate approach, i.e. consider each data independently, then do a multiple comparison correction based on clustering (cluster mass and tfce) - but the weighting doesn't! and that led to crazy false positive rates, until I found the right way to resample the data (turns out you cannot center the data for the null, just break the link between X and Y; and you should reuse the weights rather than compute new ones each time ... ah and of course there are a few other tricks like adjusting your mean squared error since the hyperplane doesn't overfit anymore, you must be less than ordinary least squares, just to make things more interesting ...).

 And so it works

Having solved that subject inference issue, we can move on and pool those parameter estimates and compute group level statistics. At 1st glance, great, more statistical power, more effects detected (figure 2) - thanks to more precision which can reduce between subject variance. 

Figure 2. Group results from Wakeman and Henson (2015) face experiment using 1st level OLS parameter estimates (top) vs. 1st level WLS parameter estimates (bottom) illustrating the discrepancy of more statistical power (left) despite smaller effect sizes (right).  

Digging deeper, you'll see it's more complicated because overall effects are weaker (check the scales of those maps ... ). It's not totally surprising if you think you got rid of most influential trials (but I agree this could have went the other direction). Weaker effects but more statistical power? well the relative size of real effects compared to space/time areas where nothing happen is larger, that is the distributions are right skewed - and since resampling under the null gives very similar cluster masses and tfce maxima across methods and right skewed maps cluster more, we end-up with more statistical power at the group level when using subject level weighted least squares than using standard ordinary least squares. 

I can see other applications than GLM-EEG, basically we have weights for profiles (temporal/spectral) acting as variance stabilizers, there must be a few applications out there that can make use of that.  

As always, nothing is simple in research - but isn't why we love it?

Tuesday, 5 January 2021

MEEG single trial metrics

 Measuring the quality of MEEG single trials

What's the goal?

In a typical MEEG data acquisition session, many trials for a given experimental condition are collected. We know there are variations from trial to trial and some of those variations are measurement errors, others are physiological by nature, and can sometimes actually be relevant and underlie behavioural variability. There is, therefore, some interest in being able to come up with some metrics for each trial.

Time-series metrics

What metrics can be of interest for time-series? or spectra for that matter. Importantly I want those metrics to be independent of specific features in the data. That is we can compute various ratios and variations around/between/among features (e.g. peaks) but this requires to define features. So what's left: 

- temporal SNR is a classic, that's the standard deviation over time, telling us how much variation we have over time

total power gives another picture with how much energy a given trial generated and is related to tSNR

- autocorrelation indicates the smoothness of the data

There is one more, that does depend on a reference. Sometimes, we want to know how much a trial differ from 'others' ; but this other cannot be the mean because the mean is computed using the trial of interest - of one provide a reference mean then we can compute the amplitude deviation 

- amplitude variation, that's the standards deviation from a reference trial

What's does that look like?

Here is an example from 18 subjects and I want to compare the trials underlying those black and red mean ERPs (from Wakeman and Henson, 2015). FYI those were defined as being different in multivariate space, but I want here to know what this means back into the standard time domain.

Figure 1. Means of 'black' and 'red' trials

Temporal SNR

Comparing the red and black trials subject-wise (bootstrapped 20% trimmed mean differences) show that 12/18 subjects plotted above (subjects 2, 3, 4, 5, 6, 7, 8, 11, 13, 14, 16 and 17) have more variable red trials.

Taking the mean values across all black vs. red trials, the group comparison for figure 1 above, show a significant difference. If the mean is taken not just across all trials but also all channels, the effect is smaller and not significant.

Figure 2. tSNR for channels showing 'maximum differences' (see above) and averages across all channels 

Total Power

Comparing the red and black trials subject-wise (bootstrapped 20% trimmed mean differences) show those same 12/18 subjects have an effect, i.e. since red trials are more variable they are more powerful.

Taking again the mean values across all black vs. red trials, the group comparison for figure 1 above, show a significant difference. If the mean is taken not just across all trials but also all channels, the effect is smaller and not significant.

Figure 3. Power for channels showing 'maximum differences' (see above) and averages across all channels 


Comparing the red and black trials subject-wise (bootstrapped 20% trimmed mean differences) show that 12/18 subjects have an effect, i.e. red trials are more repeating themselves and thus less smooth - 9 subjects are the same as before (3, 4, 5, 7, 8, 11, 13, 16 and 17) and 3 uniquely show this effect (9,12,18).

As before, the group comparison of mean values across all black vs. red trials in figure 1 above, shows a significant difference. If the mean is taken not just across all trials but also all channels, the effect is smaller and not significant.

Figure 4. Power for channels showing 'maximum differences' (see above) and averages across all channels 

This result is interesting, because looking at the mean time courses in figure 1, one could think of a peak in high frequencies for red trials - but that doesn't seem to be the case (figure 5) because most often those oscillations are not at a given frequency - yet smoothness is captured here showing differences. 

Figure 5. Mean Power spectrum of trials used to generate mean time course in figure 1.

Amplitude variations

Since those red and black averages/trials comes from a channel-wise multivariate 'distance', maybe unsurprisingly the mean amplitude variations of red trials compared to the mean of black trials are significant for every subject, at the group level, both for the max channels and averaged across all channels. This last result is interesting to me, because the channel-wise distance seem to capture trials with different dynamics (tSNR/smoothness) leading to full scalp differences in amplitudes, and apparently, it's ok to use averages for ERP .. maybe not, but that's another story.

Figure 6. Amplitude variations for channels showing 'maximum differences' (see above) and averages across all channels.


tSNR and Power capture the same information albeit different interpretation, and autocorrelation gives us an idea of smoothness which is somehow inverse of tSNR but goes beyond it - maybe the best metric to capture how noisy data are. Finally, amplitude variations confirm differences of any trial to a reference but I'm not sure it's that useful - I'd happily take comments on all that.

Friday, 7 June 2019

Bootstrapping and multiple comparisons correction in EEG

A Matlab tutorial for bootstrapping with an introduction to multiple comparisons correction in EEG

Literate script to demonstrate how the bootstrap works and how we can use it for multiple comparison correction in EEG. Some of the code relies on LIMO EEG. A large part was inspired by G. Rousselet tutorial on bootstrap (in R). The code below (and data) can be found here.

The bootstrap is a well-established early computer-age inferential method (Efron, 1979; Efron & Hastie, 2016; Efron & Tibshirani, 1994). The bootstrap is based on the idea that using only the data at hand can sometimes give better results than making unwarranted assumptions about the populations we’re trying to estimate. The core mechanism of the bootstrap is sampling with replacement from the data, which is a form of data-driven simulation.

The bootstrap idea

The idea is to make inferences using the data at hand only, avoiding making assumptions about the underlying distribution, observations are coming from by sampling with replacement the data.

% imagine we do an experiment, and have 60 observations
clear variables
N = 60; obs = randn(N,1);

% We can now compute the mean
avg_obs = mean(obs);

The theory tells us that, for normally distributed data, the mean follows a Student t-distribution with N-1 degrees of freedom. A confidence interval (CI) for the mean is an interval that covers the population mean with a rate of error alpha, here 5% for a 95%CI (Morey et al., 2016)

alpha_value = 5/100;
Tcritic     = tinv(alpha_value/2,N-1); % divide by two 2.5% each side of the distribution
se          = std(obs)/sqrt(N); % standard error
CI          = [avg_obs-abs(Tcritic)*se avg_obs+abs(Tcritic)*se]; % same as [~,~,CI]=ttest(obs)

Let's illustrate what it means

figure('units','normalized','outerposition',[0 0 1 1])
subplot(1,2,1); scatter([0.5:1/(N-1):1.5],obs,30,'filled');
box on; grid on; axis tight; xlabel('observations'); ylabel('observed values')
hold on; rectangle('Position',[0.75 CI(1) 0.5 diff(CI)],'Curvature',[0.4 0.4],'LineWidth',2,...
    'FaceColor',[0.35 0.35 0.35],'EdgeColor',[0.35 0.35 0.35]);
title(sprintf('Observed data and 95%% CI of the mean [%g %g]',CI(1),CI(2)))

sample = [-5:.01:5]+avg_obs;
theoretical_distribution = tpdf(sample,N-1);
box on; grid on; axis tight; xlabel('mean values'); ylabel('relative frequency')
title(sprintf('Student t(%g,%g) distribution',avg_obs,N-1))
hold on; [~,position] = min(abs(theoretical_distribution-alpha_value));
text(-4.3,0.06,['T critic = ' num2str(Tcritic)]);
text(2.2,0.06,['T critic = ' num2str(abs(Tcritic))]);

We can compute the same thing without assuming a Student t distribution and simply computing the distribution of sample means, as if we were collecting data from many experiments - ie by bootstrapping.

Nboot     = 1000;
resamples = randi(N,N,Nboot); % pick at random out of the N observations
avg_boot  = sort(mean(obs(resamples))); % the Nboot means from sampling
CI_boot   = [avg_boot(alpha_value/2*Nboot) avg_boot(Nboot-alpha_value/2*Nboot)];

% Let's compare
figure('units','normalized','outerposition',[0 0 1 1])
subplot(1,2,1); plot(sample+mean(avg_boot),theoretical_distribution,'LineWidth',3)
box on; grid on; axis tight; xlabel('mean values'); ylabel('relative frequency')
hold on; plot(repmat(sample(position)+mean(avg_boot),1,6),[0:.01:alpha_value],'r','LineWidth',2);
title(sprintf('Student t(%g,%g) distribution',avg_obs,N-1))
subplot(1,2,2); h = histogram(avg_boot, 100);
title('density of mean values'); xlabel('mean values'); ylabel('frequency')
hold on; bar(CI(1),5,h.BinWidth,'r');
axis tight; grid on; box on

Let's have a quick look at different distributions - and since this is that simple, let's get both the mean and the median confidence intervals

figure('units','normalized','outerposition',[0 0 1 1]);
resample = randi(100,100,1000);
for d=1:3
    if d == 1
        % normal data as before - does this really exist?
        data = sort(randn(100,1))+80;
    elseif d == 2
        % lognornal - classic reaction times
        No = normrnd((200+randn(1)*50),25*abs(randn(1)),[100,1]);
        Ex = exprnd(75,[100,1]) + (150+randn(1)*50);
        data = (No+Ex)/2;
        % uniform
        unif = sort(rand(200,1)).*150;
        data = unif(end-100:end);
        data = data.*100/max(data); % pretend max is 100
    boot = sort(mean(data(resample)));
    boot2 = sort(median(data(resample)));
    subplot(1,3,d); histogram(data,50); [~,~,tCI]=ttest(data);
    title(sprintf('mean %g, 95%% CI [%g %g] \n boot CI [%g %g] \n median %g boot CI [%g %g]',...
        mean(data), tCI(1), tCI(2), boot(25), boot(975), median(data), boot2(25), boot2(975)))


Bootstrap does not bring robustness

In statistics, the term robust means resistant to outliers. The mean is the least robust location estimator with a breakdown point of 0 while the median is the most robust with a breakdown point of 50% (i.e. doesn't change up to 50% of outliers).

A=[1:9 50]; Nboot = 1000;
% rather that completly random, the resampling can be constrained, here to
% at least 4 unique values ; which ensures some distribution in the data
index = 1;
while index<Nboot
    tmp = randi(10,10,1);
    if length(unique(tmp))>4
        bootsamples(:,index) = tmp;
        index = index+1;

means = sort(mean(A(bootsamples)));
medians = sort(median(A(bootsamples)));
low = floor(alpha_value*Nboot); high = Nboot-low;
figure('units','normalized','outerposition',[0 0 1 1])
subplot(1,2,1); histogram(means, 100); axis tight; grid on; box on
title(['Boostrapped mean values ' num2str(mean(A)) ' 95% CI ...
   [' num2str(means(low)) ' ' num2str(means(high)) ']'])
subplot(1,2,2); histogram(medians, 100); axis tight; grid on; box on
title(['Boostrapped median values ' num2str(median(A)) ' 95% CI ...
   [' num2str(medians(low)) ' ' num2str(medians(high)) ']'])

Bootstrap is not assumption free

When we sample with replacement, we sample from the observed data only 

A = [1 1 1 2 2 2 3 3 3 6 9 12 15 15 15 16 16 16 17 17 17];
resamples = randi(length(A),length(A),Nboot); % pick at random out of the N observations

We could, however, imagine that missing values exist, simply not observed in the current sample! Using such priors is called Bayesian bootstrap Rubin (1981)

n = length(A);
Bayes_resamples = NaN(size(resamples));
for boot=1:Nboot % bootstrap loop
    theta = exprnd(1,[n,1]);
    weigths = theta ./ repmat(sum(theta),n,1);
    Bayes_resamples(:,boot) = (datasample(A,n,'Replace',true,'Weights',weigths));

% let's check CI
[~,~,tCI] = ttest(A); % from observed data, assume normality
tmp       = sort(mean(A(resamples))); % build your distribution of means
bootCI    = [tmp(low) tmp(high)]; % take 2.5% 97.5% of this distribution
tmp       = sort(mean(A(Bayes_resamples))); % sample from hypergeometric distribution
ci        = 1:low; % build floor(alpha_value*Nboot) CI
ciWidth   = tmp(ci+high) - tmp(ci); % all intervals in the range
[~,index] = find(ciWidth == min(ciWidth)); % densest centile
HDI       = [tmp(index) tmp(index+high)]; % highest density interval

% let's see how different this is
figure('units','normalized','outerposition',[0 0 1 1])
subplot(1,3,1); histogram(A);
title(['observed data: 95% CI =[' num2str(tCI(1)) ' ' num2str(tCI(2)) ']'])
subplot(1,3,2); all = A(resamples); histogram(all(:));
title(['bootstrapped data: 95% CI =[' num2str(bootCI(1)) ' ' num2str(bootCI(2)) ']'])
subplot(1,3,3); all = A(Bayes_resamples); histogram(all(:));
title(['Bayes bootstrapped data: 95% HDI =[' num2str(HDI(1)) ' ' num2str(HDI(2)) ']'])

Application to ERP

electrode = 60;
title('condition 1 - all subjects')
title('condition 2 - all subjects')

% set the proper time in msec
load('cond1-2_single_subjects_Mean'); % loading data from LIMO central tendency outputs
timevect    =;  
% compute Mean of two conditions
[est,HDI]   = limo_central_estimator(squeeze(,:,1,:)),'Mean',95/100);
[est2,HDI2] = limo_central_estimator(squeeze(,:,2,:)),'Mean',95/100);

% plot
figure('units','normalized','outerposition',[0 0 1 1]); subplot(2,1,1)
plot(timevect,est); hold on; plot(timevect,est2); grid on
fillhandle = patch([timevect fliplr(timevect)], [HDI(1,:),fliplr(HDI(2,:))], [0 0 1]);
set(fillhandle,'EdgeColor',[0 0 1],'FaceAlpha',0.2,'EdgeAlpha',0.8);%set edge color
fillhandle = patch([timevect fliplr(timevect)], [HDI2(1,:),fliplr(HDI2(2,:))], [1 0 0]);
set(fillhandle,'EdgeColor',[1 0 0],'FaceAlpha',0.2,'EdgeAlpha',0.8);%set edge color
title('Means and 95% HDI')

% same using trimmed means
[est,HDI]   = limo_central_estimator(squeeze(,:,1,:)),'Trimmed mean',95/100);
[est2,HDI2] = limo_central_estimator(squeeze(,:,2,:)),'Trimmed mean',95/100);

subplot(2,1,2); plot(timevect,est); hold on; plot(timevect,est2); grid on
fillhandle = patch([timevect fliplr(timevect)], [HDI(1,:),fliplr(HDI(2,:))], [0 0 1]);
set(fillhandle,'EdgeColor',[0 0 1],'FaceAlpha',0.2,'EdgeAlpha',0.8);%set edge color
fillhandle = patch([timevect fliplr(timevect)], [HDI2(1,:),fliplr(HDI2(2,:))], [1 0 0]);
set(fillhandle,'EdgeColor',[1 0 0],'FaceAlpha',0.2,'EdgeAlpha',0.8);%set edge color
title('20% Trimmed Means and 95% HDI')

The beauty of using HDI is that 1 - this is the confidence interval of the mean (not the alpha prob. to include the population mean interval) and 2 - since this is Bayesian, you can accept the null (rather than only fail to reject).

Estimating H0 using bootstrap

The same way as we can compute the distribution of mean values, we can do it for the null - for instance, a one-sample t-test testing if the trimmed mean(data)=0

% let's get the trimmed mean again with 95% CI
data = squeeze(,:,1,:));
[t,tmdata,trimci,se,p,tcrit,df]=limo_trimci(data, 20/100, 5/100, 0);

% step 1: make null data (the trimmed mean = 0 --> remove TM from each subject)
null = data - repmat(tmdata,[1 1 size(data,3)]);

figure('units','normalized','outerposition',[0 0 1 1]);
subplot(1,2,1); plot(timevect,squeeze(data(electrode,:,:)));
grid on; hold on; plot(timevect,squeeze(tmdata(electrode,:)),'k','LineWidth',2);
title('observed data and trimmed mean');
subplot(1,2,2); plot(timevect,squeeze(null(electrode,:,:)));
grid on; hold on; [~,tmnull]=limo_trimci(null, 20/100, 5/100, 0);
title('null data and trimmed mean');

% step 2: bootstrap as usual and
TM_null = NaN(Nboot,length(timevect)); % record trimmead means under the null (percentille bootatrap)
T_null  = NaN(Nboot,length(timevect)); % record t values under the null (percentille-t bootatrap)
boot_table = limo_create_boot_table(data,Nboot);
parfor b=1:Nboot
    [tnull,tmnull] = limo_trimci(null(:,:,boot_table{electrode}(:,b)), 20/100, 5/100, 0);
    TM_null(b,:)   = tmnull(electrode,:);
    T_null(b,:)    = tnull(electrode,:);

% step 3: get p-values
% p-values are the prob under the null to observed a value equal or bigger
% than a threshold, we thus compute how many times the observed data are
% bigger than the null
pboot  = mean(squeeze(tmdata(electrode,:,:))>TM_null);
pboot  = min([pboot' 1-pboot'],[],2)';
ptboot = mean(squeeze(t(electrode,:,:))>T_null);
ptboot = min([ptboot' 1-ptboot'],[],2)';

% let make a figure
figure('units','normalized','outerposition',[0 0 1 1]);
plot(timevect,squeeze(tmdata(electrode,:))); hold on; grid on
fillhandle = patch([timevect fliplr(timevect)], ...
    [squeeze(trimci(electrode,:,1)),fliplr(squeeze(trimci(electrode,:,2)))], [0 0 1]);
set(fillhandle,'EdgeColor',[0 0 1],'FaceAlpha',0.2,'EdgeAlpha',0.8);%set edge color
plot(timevect,(squeeze(p(electrode,:))<0.05)-11,'rx'); hold on
axis([timevect(1) timevect(end) -10 6])
title('Trimmed mean with sig values based on t distrib (red) or percentile boot (green) or percentile t-boot (black)')

Multiple comparisons correction

The issue we are facing now is that we have done 501 t-tests so we can expect 25 false positives, we just don't know which ones

% let's check if that's true
h = ttest(randn(18,length(timevect)));
fprintf('found %g false positives out of %g expected \n',sum(h),5/100*length(timevect))
found 20 false positives out of 25.05 expected

In EEG we talk about family-wise error (FWE) correction because tests are correlated with each other. The FWER = 1 - (1 - alpha)^n, so for 501 tests we reach 100%. Since the FWER is the probability to make at least one error, let's compute the distribution of maximum t value -- if the biggest effect is controlled then smaller effects are controlled too. 

for simulation = 1:100
    h(simulation,:) = ttest(randn(18,length(timevect)));
E = mean(h); % on average how many errors per variable

figure('units','normalized','outerposition',[0 0 1 1]);
subplot(1,2,1); plot(E.*100,'LineWidth',3); grid on; box;
title(['average type 1 per variable: ' num2str(mean(E))])
subplot(1,2,2); plot(cumsum(E).*100,'LineWidth',3); grid on; box;
for n=1:length(timevect); FWER(n) = 1 - (1 - 0.05)^n; end
hold on;  plot(FWER.*100,'LineWidth',3); axis([0 100 1 110])
title('FWER cumulative vs predicted');

let's see what happens to our ERP using maximum statistics 

Tmax_null = sort(max(T_null,[],2)); % this time we record t-values
high      = round((1-alpha_value).*Nboot);
threshold = Tmax_null(high); % that's the value we need to go other for
                             % the largest effect anywhere to be significant
figure; histogram(Tmax_null); hold on; grid on
title(['Distribution of maxima 95th percentile = ' num2str(threshold)])

figure('units','normalized','outerposition',[0 0 1 1]);
plot(timevect,squeeze(tmdata(electrode,:))); hold on; grid on
fillhandle = patch([timevect fliplr(timevect)], ...
    [squeeze(trimci(electrode,:,1)),fliplr(squeeze(trimci(electrode,:,2)))], [0 0 1]);
set(fillhandle,'EdgeColor',[0 0 1],'FaceAlpha',0.2,'EdgeAlpha',0.8);%set edge color
axis([timevect(1) timevect(end) -10 6])
title('Trimmed mean with sig values based on percentile t-boot (red) with MCC (black)')

So far, we have considered each time point as independent and even didn't consider space (over all electrodes) - Dedicated methods have been developped like cluster-mass and tfce do do just that, see Pernet et al 2015 for an overview of these, more powerful, approaches. In short, instead of looking at the maximum of all data point, electrodes and time points are clustered and we look at the maximum of clusters under null.

Wednesday, 30 January 2019

Assessing anatomical brain parcellation


Anatomical brain parcellation is used in three scenarios: (1) identify a brain region to measure some level of activity, for instance, changes in BOLD or tracer concentration, (2) determine the anatomical properties of brain regions, for instance, volume or thickness, (3) use parcels to calculate connectivity values, for instance, rsfMRI correlations or the average number of streamlines in DTI. Given these applications, making sure the right regions and borders are obtained is therefore essential. Dr Shadia Mikhael (graduated in Septembre) looked into that under my supervision, and she did well :-) 

Where are we? (as in 2019)

Part 1: protocols

In Mikhael et al (2017) we reviewed in details how anatomical parcels were made, that is how borders are defined and on the basis of what atlases/templates. Surprisingly, the main contenders, FreeSurfer, BrainSuite and BrainVISA have some serious issues. 

There are little details on the reference population used: sex ratio, ethnicity, age, education, handedness, etc. In general, software relies on small numbers of participants sometimes averaging across ages and ethnicities. Perhaps related to this point, accounting for anatomical variability is not always explicit. The gyral parcels, or gyral volumes, are regions bound by the grey matter, or pial, layer on the exterior, and the WM layer on the interior in all software (+ sulcal derived regions in BrainVisa). The issue, however, is that some regions are defined relative to each other. Practically this means that after landmarks are defined and some regions found, you could find definitions of the medial border of region A defined by region B and then the lateral border of region B defined by region A, so where exactly is the border? and this causes big problems when accounting for anatomical variability such as branching, interruptions and even double gyri.

sub-01 from
despite a clear double cingulate, software packages can fail to capture it all, with a knock-on effect on neighbouring parcels (let's hope such variations are not more frequent in some patient populations or all our analyses are wrong)

Part 2: software assessment

To assess if those differences in protocol matters, we created i) a protocol (for 4 regions) that identifies borders from landmarks only and explicitly accounts for variability; ii)  a dataset of manually segmented regions using this protocol iii) a software (Mikhael and Gray, 2018) to compute volume, thickness, surface area from binary images and iv) ran some basic comparisons against FreeSurfer, since some measures were validated against histology (Mikhael et al, 2019).

Some interesting insights came out of this:
- creating a protocol that leads to the unambiguous definition of a region is hard. There are many variations across individuals and being able to define how to always find each border requires multiple cases description.
- manual segmentation is tricky. Even with a detailed protocol, some borders can be inconsistent and inter-rater agreement is still necessary.
- there are so many assumptions to make when deriving morphometric values, and the issue is that those assumptions are, for the most part, not explained in software we use. Take for instance thickness. Should you measure the shortest distance from every voxel and average those, or maybe take the orthogonal projection from one side of the gyrus to the other (and vice versa) and average those? we choose, like FreeSurfer, the second option. Come then the problem of folds, when you have a curve and work with voxels, the orthogonal projection can shoot right between voxels on the other side .. that's ok let's remove this projection. How many projections to compute (step size) and what to remove then? this is just a simple example, but I think not knowing all those decisions in the software can create problems, especially if one wants to compare results from 2 different studies that used different software (even assuming regions match roughly).

Part 3: What to do with variablity?

The final part of this work was to compare outputs from software, using our parcels as a reference frame. Since we know software packages have different borders and so on, we expect differences. This is like saying let's compare the GDP of France using Euros and of the UK using Sterlings - well yeh it's different. Now if we compute each package metric relative to our manually derived values, there are in the same unit, just like comparing GDP on France and UK in US dollars. 

In Mikhael and Pernet (2019) we did just that and found: i) as expected from the protocol analysis, that anatomical variablity is not well accounted for; even missing double cingulate gyri! ii) volume and surface areas are mostly affected by this problem since parcel sizes change (while thickness even if from the wrong region, would need dramatic changes to create big differences on average) iii) multiatlas registration is the most accurate.

Violin plots show ROI cortical volume in cm3 computed by Masks2Metrics (M2M), FreeSurfer (FS-DK, FS-DKT), BrainSuite (BS), and BrainGyrusMapping (BGM) (the middle lines represent the medians, boxes the 95% Bayesian confidence intervals, and the density of the random average shifted histograms). Line plots show the relative difference from each package (FS, BS, BGM) to the ground truth estimates (M2M) for each subject (each line is a subject). Double CG occurrences were observed for subjects 1, 5, 6, and 8 in the left hemisphere, and subjects 6 and 10 in the right hemisphere. BrainSuite failed for subjects 4 and 6


For the time being, if you have a good multiatlas tool for parcellating, this is likely to most efficient. The issue is that those, for now, will only give you regions to possibly compute volumes. This means to address goals 1 and 3 listed in the rational, this is my to go to option. If you want to derive morphometrics, then manual editing is required, yes even for those 600 cases you have to process.


Monday, 16 July 2018

Encouraging diversity in science

Should we use quotas

After reading this blog post on women participation in computer science I was left puzzled by the argument that equal opportunity is all we need. Fortunately, Twitter was here to the rescue, with colleagues pointing me in the right direction. I must mention here the post from Lorena A Barba that is way more detailed that was I'm summarizing here. Just to get to the point, I tried here to simply put down the two approaches, arguments and counter-arguments. I focus on men and women differences, but I guess this applies equally to any unrepresented social group. 

Equality vs Equity

The idea of 'equality' is that we can increase diversity using incentives and by removal of artificial barriers. This is the reasoning behind promoting equal opportunity for all.

The idea of 'equity' is that we can increase diversity by changing structures that lead to differences in outcomes. This is the reasoning behind promoting equal outcomes for all.

Equality arguments

There are many factors that influence outcomes, and focusing on a marginal effect cannot account for the observed effect. Concretely, it means that e.g. there are many factors that explain why men and women have different salaries, and gender per se is just one of them. Yes that this is likely true, but if we were to quantify this (which I haven't researched) I'm pretty sure gender itself would account for a huge amount of the variance, so the multifactorial argument is not good enough.

Another argument is that men and women are different for biological and cultural reasons and that many workplaces are biased toward men values and interest (for historical and cultural reasons). An observed difference in outcome in our modern societies, say e.g. the percentage of women in position, is, therefore, a reflection of women free choice, since we offer equal opportunities in hiring.

Equity arguments

There is no such a thing as free choice. Choice in adulthood is mostly driven by social expectation and cultural pressures translated in parental guidance and education. This means that the choice made by women to not enter STEM is not a reflection of their 'natural tendencies' (even if these truly existed) but a reflection of an enforced stereotype that they are not good at it. Such stereotypes appear from age 3 to 5, and the bias for STEM has been observed from age 6. Of course, stereotypes are useful to develop and interact socially, but negative traits are not useful in stereotypes, in fact, they have been shown to impede cognitive performance.

Equal opportunity assumes that facing two candidates with equal merits, each candidate has the same chance. While in theory, this is nice, this discards all the evidence on the implicit bias which tells us that chances are not equal. This also means that we have objective ways to measure merit, which is simply not the case. This implies that to fix social unbalance we need quotas because equal opportunity does not exist (by the way quotas have also been linked to increased performances, and do not 'lower the bar' since you remove the worst men performers for the best women).


In theory, equal opportunity is what we want. In practice, because of stereotypes in our society that drive choices (to enter a career or to select a candidate), and of too often poor measures of merit, equity (quotas) if what we need - until negative traits in stereotypes disappear. 

Thursday, 21 June 2018

#OHBM2018 from Home

Learning from a conference at distance with Twitter

I'm a regular at OHBM, but I was in the USA and Canada in May, and I'm going to France in July, so this month of June I stayed home, missing my all-time favourite conference.


At least I have Twitter @CyrilRPernetand many of my friends and colleagues use it too so I can benefit from them attending and telling the world what they found interesting.

Cool talks/posters and papers I had (more or less) time to check because I wasn't there

so what was hot on my twitter handle?

rs-fMRI and connectomics

Thomas Yeo started before the educational posting about an improved parcellation scheme for rs-fMRI: Spatial Topography of Individual-Specific Cortical Networks Predicts Human Cognition, Personality, and Emotion that comes with the code on GitHub.

Michael Breakspear was self-promoting a paper from the dynamic functional connectivity workshop: using structural connectome to simulate dynamic electrophysiology - super cool gif see the preprint: Meta stable brain waves

Daniel Margulies keynote on rs-fmri gradients had a lot of attention too, here are two papers of interest: Situating the default-mode network along a principal gradient of macroscale cortical organization Large-Scale Gradients in Human Cortical Organization

Martijn van den Heuvel keynote on connectomics also inspired many - I found his last paper really inspiring: Multiscale examination of cytoarchitectonic similarity and human brain connectivity

Tools, tools and tools

J. Lopez received the NeuroImage paper of the year for 'Reconstructing anatomy from electro-physiological data'

Andrew Doyle
(sorry Pamela*) released a full tutorial on deep learning for the educational that is simply  amazing:

Got to mention the poster from Matteo Visconti on getting your Data into BIDS straight of the scanner 

Elizabeth Dupre got some interest in the hack room with muti-echo denoising and her tedana tool - thx to Ross Markelo for pointing out to 'the' right paper: Separating slow BOLD from non-BOLD baseline drifts using multi-echo fMRI

Out of this world

I've been checking up on the paper behind the Talairach lecture from K Friston: The Markov blankets of life: autonomy, active inference and the free energy principle

Special mention

Alex Bowring, Camille Maumet, Thomas Nichols paper on 'Exploring the Impact of Analysis Software on Task fMRI Results' got plenty of hits

* my mistake thought it was Pamela Douglas (speaker one) who made this one ...  updated on the 23rd

Sunday, 3 June 2018

Field maps anyone?

Physics and signal processing of field maps for dummies

In this post, I'm trying to make sense of why we acquire field maps, what they are, and how we use them (and yes I assume you know a little bit of how an MRI scanner works).

EPI images

Standard fMRI relies on Echo Plannar Imaging, you know this single pulse that gives you an entire slice of brain. Now, you also know MRI relies on gradients: while B0 is the magnetic field that orients the proton's spins, the observed resonance frequency also depends on additional small gradients. So typically, we use a pulse to excite protons in a slice and apply gradients that change linearly in space: Gy to create linear change from anterior to posterior in phase as protons are relaxing and Gx to create linear change from left to right in frequency,  then the magic of (roughly) an inverse 2D Fourier transform takes place and we have a picture of that slice. 

We also know that B0 is not homogenous and that there is a spatially varying magnetic susceptibility of the subject, that is the signal is not homogenous across the brain/head. As the phase-encoding gradient is several orders of magnitude weaker than the frequency encoding gradient, the inhomogeneous magnetic field manifests itself mainly as a geometrical distortion of the 2D slice image along the anterior-posterior direction (of course on your scanner you can ask to switch the frequency/phase encoding to get distortion in the left/right direction instead, but is this wise?).


If all slices were affected by the same distortion, there would be no problem but i) the amount of deformation differs from slice to slice (not the same magnetic susceptibility) ii) the amount of deformation differs from acquisition to acquisition (how stable is the scanner over time) iii) the amount of deformation differs because people move.

Motion causes i) susceptibility distortion, that is field inhomogenity causes different distortions of the images, and ii) susceptibility dropout, that is field inhomogenity causes signal loss due to through-plane dephasing. 

The solution

Acquire an image of the (phase encoding direction) deformation to estimate a nonlinear unidirectional 2D (un)warping function (leaving the mathematical solutions proposed to the experts). The image of the deformation is typically a difference between images acquired with two different echo time or phase encoding directions.

Depending on your scanner, you can get (i) a phase difference image along with the magnitude image (ii) the two phase images and corresponding two amplitude images (iii) a field map showing the inhomogeneity (iv) multiple phase encoding directions. So this is pretty important to know what you get from your scanner before starting processing (undistort) the data. You can however already see here what we want: we want the field map (option iii) that we will derive from the phase and amplitude difference (option i) computed as a difference (option ii and iv).

Demonstrated effects on normalization and statistics

The effect of using field map is undistortion of volumes, which results in better voxel / time series alignments. Better voxel alignment means more accurate data and thus 'better' statistics. Undistorted volumes also mean better normalization to the T1 space, and across subjects, better normalization means better statistics.  

Better realignment

Andersson et al. 2001 showed that unwarping lead to smaller residual variance in your time series,  which consequently should improve the location of effects (see e.g. Chambers et al 2015).

Better normalization

Calhoun et al. 2017 showed that affine transformation of the EPI data to an EPI template followed by nonlinear registration to an EPI template reduces variability across subjects compared to affine transformation of the EPI data to the anatomic image for a given subject, followed by nonlinear registration of the anatomic data to an anatomic template, which produces a transformation that is applied to the EPI data. This was however not the case once distortion correction was used (see the figure 10 in the paper).

Better detection of resting state networks

Togo et al. 2017. results suggest that RSN show more robust functional connectivity (in particular DMN) and basal ganglia RSN showed some decreases in functional connectivity primarily in white matter, indicating imperfect registration/normalization without correction.

The work 

Depending on the sequence used you need to know:

- effective echo spacing = 1/(BandwidthPerPixelPhaseEncode * image dimension in phase direction)

- total readout time 
--> for FSL, this is the time from the centre of the first echo to the centre of the last echo = (number of echoes - 1) * echo spacing for plain standard EPI, so for a dual echo that will, be the echo spacing. Most of the time, we use parallel imaging and over fancy acquisition trick, so it's better to use = (image dimension in phase direction - 1) * EffectiveEchoSpacing; 
--> for SPM, this is from the beginning of the first echo to the end of the last echo = 1/(BandwidthPerPixelPhaseEncode)

Read from json file: As all good neuroimagers, we use BIDS and thus images come with json files. In case you use dcm2niix, the json for the EPI already contains all that you want, for SPM 1/BandwidthPerPixelPhaseEncode and for FSL TotalReadoutTime.

Now refer the manuals to work out what needs to be done ... 

[updated for json info]

GLM single trial weighting for EEG

W eighting single trials in EEG for better precision   It has been may years since I tried to implement a single trial weighted least square...