Sifting through crates.io for malware with OSSF Package Analysis

   

I have a particular fascination with the threat of supply chain compromise via package manager operations. Not so much that a malicious library will be embedded into the final product; rather, that when the programmer installs a package, such as from NPM, PyPI, or crates.io, arbitrary code is executed, which may deposit a backdoor that grants access to the developer’s access, secrets, etc.

Again, as developers, we should remember that simply installing a source code package from a repository can invoke arbitrary code on your system.

One way to monitor for these sorts of attacks is to do large scale installations of all available packages and see what behavior we encounter. This is one thing that the Open Source Security Foundation (OpenSSF) does.

The OSSF Package Analysis project:

seeks to understand the behavior and capabilities of packages available on open source repositories: what files do they access, what addresses do they connect to, and what commands do they run? The project also tracks changes in how packages behave over time, to identify when previously safe software begins acting suspiciously.

They do this by:

  1. Package repositories are monitored for new packages.
  2. Each new package is scheduled to be analyzed by a pool of workers.
  3. A worker performs dynamic analysis of the package inside a sandbox.
  4. Results are stored and imported into BigQuery for inspection.

Sandboxing via gVisor containers ensures the packages are isolated. Detonating a package inside the sandbox allows us to capture strace and packet data that can indicate malicious interactions with the system as well as network connections that can be used to leak sensitive data or allow remote access.

architecture diagram

Notably, the project exposes their BigQuery data set here.

The data set currently includes packages from:

I started exploring crates.io packages first, since I follow the Rust ecosystem and am familiar with the crates.io infrastructure. In my experience, NPM has many more packages and also quite a bit of malware. This can be overwhelming. I think crates.io has enough data to get started while not being exhausting.

Sifting through crates.io packages

I inspected the dynamic analysis results captured by OSSF Package Analysis as they installed and built Rust packages from crates.io. I expect all of the interesting activity will come from build.rs scripts that can do arbitrary things at compile time.

My general strategy is to aggregate activity and estimate the prevalence of various techniques, and then look in the “long tail” of uncommon behavior. For example, by reviewing the DNS names resolved during package installation, and filtering out the domains related to crates.io, I can discover when curl (or similar) is used to fetch untrusted remote content.

DNS resolutions during crates.io package installation

crates.io package name, version, and DNS resolutions during installation. Just to get a sense for the data, so selected from a recent day and limited to 100.

SELECT 
  T.Package,
  Queries.Hostname
FROM
  `ossf-malware-analysis.packages.analysis` AS T,
  T.Analysis.install.DNS as DNS,
  DNS.Queries as Queries
WHERE
  TIMESTAMP_TRUNC(CreatedTimestamp, DAY) = TIMESTAMP("2023-09-12")
  AND Package.Ecosystem = "crates.io"
;
Name Version Hostname
actix-client-ip-cloudflare 0.1.0 github.com
actix-client-ip-cloudflare 0.1.0 crates.io
actix-client-ip-cloudflare 0.1.0 static.crates.io
actix-client-ip-cloudflare 0.1.0 api.github.com
bnf_sampler 0.2.0 api.github.com
bnf_sampler 0.2.0 github.com
bnf_sampler 0.2.0 crates.io
bnf_sampler 0.2.0 static.crates.io
cargo-scaffold 0.8.12 crates.io
cargo-scaffold 0.8.12 static.crates.io

Prevalence of DNS resolutions during crates.io package installation

SELECT 
  Queries.Hostname, 
  COUNT(*) AS `count`
FROM
  `ossf-malware-analysis.packages.analysis` AS T,
  T.Analysis.install.DNS as DNS,
  DNS.Queries AS Queries
WHERE
  TIMESTAMP_TRUNC(CreatedTimestamp, YEAR) = TIMESTAMP("2023-01-01")
  AND Package.Ecosystem = "crates.io"
GROUP BY Queries.Hostname
ORDER BY `count` DESC
;
Hostname count
github.com 142785
crates.io 141911
static.crates.io 141894
api.github.com 70351
objects.githubusercontent.com 214
proxy.golang.org 18
storage.googleapis.com 16
codeload.github.com 15
software.ditto.live 11
zlib.net 10
static.crates.iocrates 9
download.mosek.com 6
files.pythonhosted.org 5
pypi.org 5
www.fftw.org 4
raw.githubusercontent.com 4
cdn.intrepidcs.net 3
pkg-containers.githubusercontent.com 2
ghcr.io 2
apache.jfrog.io 2
jfrog-prod-usw2-shared-oregon-main.s3.amazonaws.com 2
www.apache.org 1
ip-api.com 1
api.telegram.org 1
archive.apache.org 1
sourceware.org 1
git.openprivacy.ca 1
_pgpkey-http._tcp.keyserver.ubuntu.com 1
drive.google.com 1
www.byond.com 1
doc-08-24-docs.googleusercontent.com 1
keyserver.ubuntu.com 1
dlib.net 1
binaries.soliditylang.org 1

At least two entries stand out as suspicious:

static.crates.iocrates seems a little weird, but is quickly explained by a bug in crates.io in this incident report: crates.io Postmortem: Broken Crate Downloads.

Uncommon DNS names resolved during crates.io package installation

We can generate the direct download URL for the crate given the name and version. The content is a gzip-compressed tar archive containing the Rust source code and Cargo.toml metadata file.

SELECT
  Queries.Hostname,
  T.Package.Name,
  T.Package.Version,
  FORMAT(
    "https://crates.io/api/v1/crates/%s/%s/download",
    T.Package.Name,
    T.Package.Version) AS url
FROM
  `ossf-malware-analysis.packages.analysis` AS T,
  T.Analysis.install.DNS as DNS,
  DNS.Queries AS Queries
WHERE
  Package.Ecosystem = "crates.io"
  AND Queries.Hostname NOT IN (
    "github.com", "crates.io", "static.crates.io",
    "api.github.com", "static.crates.iocrates")
ORDER BY 
  Queries.Hostname, 
  T.Package.Name, 
  T.Package.Version 
  DESC
;
Hostname Name Version url
_pgpkey-http._tcp.keyserver.ubuntu.com nginx-sys 0.2.0 download
apache.jfrog.io kuzu 0.0.5-pre.1 download
apache.jfrog.io kuzu 0.0.5 download
api.telegram.org xrvrv 0.1.1 download
archive.apache.org kuzu 0.0.5-pre.1 download
binaries.soliditylang.org svm-rs-builds 0.2.0 download
cdn.intrepidcs.net libicsneo-sys 0.2.0 download
cdn.intrepidcs.net libicsneo-sys 0.1.19 download
cdn.intrepidcs.net libicsneo-sys 0.1.18 download
cfhcable.dl.sourceforge.net libtirpc-sys 0.2.0 download
chromium.googlesource.com wren 0.1.12 download
chromium.googlesource.com wren-sys 0.2.5 download
codeload.github.com cblas-src 0.1.3 download
codeload.github.com cudd 0.1.4 download
codeload.github.com cudd 0.1.3 download
codeload.github.com cudd 0.1.2 download
codeload.github.com cudd 0.1.1 download
codeload.github.com cudd-sys 1.0.0 download
codeload.github.com d4 0.3.7 download
codeload.github.com d4-bigwig 0.3.6 download
codeload.github.com d4-hts 0.3.9 download
codeload.github.com d4-hts 0.3.7 download
codeload.github.com ipopt 0.5.4 download
codeload.github.com ipopt-sys 0.5.5 download

So we can see that api.telegram.org was resolved by xrvrv@0.1.1, which was nicely described by Phylum here: Rust Malware Staged on Crates.io.

I triaged the remaining uncommon DNS resolutions and didn’t find anything malicious.

Commands executed during crates.io package installation

SELECT
  T.Package.Name,
  T.Package.Version,
  ARRAY_TO_STRING(Commands.Command, " ") AS command
FROM
  `ossf-malware-analysis.packages.analysis` AS T,
  T.Analysis.install.Commands as Commands
WHERE
  Package.Ecosystem = "crates.io"
  AND TIMESTAMP_TRUNC(CreatedTimestamp, MONTH) = TIMESTAMP("2023-09-01")
ORDER BY 
  T.Package.Name, 
  T.Package.Version 
  ASC
LIMIT 10
;
Name Version command
a1_notation 0.4.0 rustc --crate-name a1_notation --edition=2021 /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/a1_notation-0.4.0/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C embed-bitcode=no -C debuginfo=2 -C metadata=c1fc3b749240cbbe -C extra-filename=-c1fc3b749240cbbe --out-dir /app/target/debug/deps -L dependency=/app/target/debug/deps --extern serde=/app/target/debug/deps/libserde-8dcf5821f9268294.rmeta --cap-lints allow
a1_notation 0.4.0 rustc --crate-name serde --edition=2018 /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/serde-1.0.188/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C embed-bitcode=no -C debuginfo=2 --cfg feature=""default"" --cfg feature=""derive"" --cfg feature=""serde_derive"" --cfg feature=""std"" -C metadata=8dcf5821f9268294 -C extra-filename=-8dcf5821f9268294 --out-dir /app/target/debug/deps -L dependency=/app/target/debug/deps --extern serde_derive=/app/target/debug/deps/libserde_derive-065cdb214387bc01.so --cap-lints allow

…which is kind of noisy and difficult to scan. So, let’s focus just on the executable, not complete command line.

Prevalence of programs executed during crates.io package installation

SELECT
  Commands.Command[offset(0)] AS exe,
  COUNT(*) AS `count`,
FROM
  `ossf-malware-analysis.packages.analysis` AS T,
  T.Analysis.install.Commands as Commands
WHERE
  Package.Ecosystem = "crates.io"
  AND TIMESTAMP_TRUNC(CreatedTimestamp, MONTH) = TIMESTAMP("2023-09-01")
  AND Commands.Command[offset(0)] NOT LIKE "/app/target/%"
GROUP BY exe
ORDER BY `count` DESC
;
exe count
rustc 144349
cc 34076
/usr/lib/gcc/x86_64-linux-gnu/11/collect2 28135
/usr/bin/ld 27778
python3 7666
sleep 7656
cargo 7647
as 6464
rm 6329
/usr/lib/gcc/x86_64-linux-gnu/11/cc1 6171
/bin/bash 2583
sed 2501
/bin/sh 2191
/usr/bin/cmake 1672
freebsd-version 1350
clang 1320
mv 1126
grep 1016
dirname 987
/usr/bin/gmake 966
ln 954
/usr/bin/cc 945
/usr/lib/llvm-14/bin/clang 816
ar 763
/usr/lib/gcc/x86_64-linux-gnu/11/cc1plus 618
c++ 489
cat 410
/usr/bin/sed 402
/usr/bin/mkdir 392
pkg-config 382
expr 267
make 226
/usr/bin/clang 182
mkdir 166
/usr/bin/c++ 158
rustfmt 146
/usr/bin/grep 139
git 135
gcc 88
basename 82
uname 81
clang++ 79
cmake 75
tr 70
/usr/bin/uname 69
m4 69
llvm-config 68
../mpn/m4-ccas 65
rmdir 57
install 51
/usr/bin/nm 45
sort 44
chmod 44
/usr/bin/perl 43
awk 43
g++ 39
date 37
/usr/bin/pkg-config 32
mawk 31
ls 31
mktemp 31
cmp 30
/usr/bin/rustfmt 30
/usr/bin/install 30
sh 29
ranlib 29
touch 27
./conftest 24
cp 22
diff 22
/usr/bin/clang++ 22
echo 22
hostname 20
/usr/lib/git-core/git-sh-i18n--envsubst 18
CMAKE_Fortran_COMPILER-NOTFOUND 18
command 14
print 14
cut 13
/usr/bin/hostinfo 11
/usr/convex/getsysinfo 11
/bin/arch 11
/bin/uname 11
/usr/bin/arch 11
/bin/universe 11
/bin/machine 11
/usr/bin/oslevel 11
/usr/lib/git-core/git-submodule 9
uniq 9
true 9
strip 9
getconf 9
objdump 9
./gen-bases 8
./gen-fib 8
/usr/bin/ar 8
nm 8
/usr/bin/ranlib 8
/usr/bin/dd 8
tar 7
id 7
rustdoc 6
/usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/prost-build-0.8.0/third-party/protobuf/protoc-linux-x86_64 6
bash 6
wc 6
python 6
go 5
./gen-fac 4
file 4
./gen-jacobitab 4
./configure 4
./a.out 4
./gen-trialdivtab 4
od 4
env 4
./gen-sieve 4
./gen-psqr 4
/sbin/ldconfig.real 4
ldconfig 4
../../gmp-src/mpn/m4-ccas 4
which 3
timeout 3
pg_config 3
./runtests.sh 3
/usr/bin/protoc 3
/usr/lib/go-1.18/pkg/tool/linux_amd64/link 3
./runtests-quiet.sh 3
./232c93d07b74.t 3
link 2
../gmp-src/configure 2
fgrep 2
tail 2
/usr/bin/go 2
/usr/lib/go-1.18/pkg/tool/linux_amd64/compile 2
ld 2
readelf 2
cmake3 2
perl 2
capnp 2
curl-config 1
/usr/lib/go-1.18/pkg/tool/linux_amd64/asm 1
/bin/nasm 1
/usr/local/sbin/nasm 1
/tmp/cguFXqiB/dummy 1
lean 1
/sbin/nasm 1
/usr/bin/nasm 1
/tmp/cgRZejKa/dummy 1
/tmp/go-build3600745051/b001/exe/godeps 1
/usr/sbin/nasm 1
/tmp/cgDqPer7/dummy 1
/usr/local/cargo/bin/nasm 1
gzip 1
/tmp/cgwqnVEe/dummy 1
/tmp/go-build1160633071/b001/exe/godeps 1
/usr/local/bin/nasm 1
/usr/bin/file 1
nasm 1

I wonder why some packages use curl during installation? Let’s see which packages those are.

crates.io packages invoking a specific program during installation

SELECT
  T.Package.Name,
  T.Package.Version,
  ARRAY_TO_STRING(Commands.Command, " ") AS command,
  Commands.Command[OFFSET(0)] AS exe,
  FORMAT(
    "https://crates.io/api/v1/crates/%s/%s/download", 
    T.Package.Name, 
    T.Package.Version) AS url
FROM
  `ossf-malware-analysis.packages.analysis` AS T,
  T.Analysis.install.Commands AS Commands
WHERE
  Package.Ecosystem = "crates.io"
  AND TIMESTAMP_TRUNC(CreatedTimestamp, YEAR) = TIMESTAMP("2023-01-01")
  AND Commands.Command[OFFSET(0)] = "curl"
ORDER BY
  T.Package.Name,
  T.Package.Version 
  ASC
;
Name Version exe command
caffe2op-bisect 0.1.4-alpha.0 curl curl http://www.fftw.org/fftw-3.3.6-pl1.tar.gz
caffe2op-ceil 0.1.4-alpha.0 curl curl http://www.fftw.org/fftw-3.3.6-pl1.tar.gz
caffe2op-channelbackprop 0.1.4-alpha.0 curl curl http://www.fftw.org/fftw-3.3.6-pl1.tar.gz
caffe2op-collect 0.1.4-alpha.0 curl curl http://www.fftw.org/fftw-3.3.6-pl1.tar.gz
cudd 0.1.1 curl curl -L https://github.com/ivmai/cudd/archive/refs/tags/cudd-3.0.0.tar.gz -o /app/target/debug/build/cudd-sys-ad36e18db6f8070c/out/cudd-3.0.0.tar.gz
d4-hts 0.3.9 curl curl -L http://github.com/madler/zlib/archive/refs/tags/v1.2.11.tar.gz
d4-hts 0.3.9 curl curl http://sourceware.org/pub/bzip2/bzip2-1.0.8.tar.gz
deno_url 0.107.0 curl curl -L -f -s -o /app/target/debug/gn_out/obj/librusty_v8.tmp https://github.com/denoland/rusty_v8/releases/download/v0.73.0/librusty_v8_release_x86_64-unknown-linux-gnu.a

Continuing work

Of course, you can apply all the above queries to the other package registries, such as NPM and Rubygems. Expect to see more data in every dimension, including ongoing attacks & malware.

I’ll keep this post updated as I craft further useful queries for exploring the OSSF Package Analysis data set. I’m keen to periodically run queries that highlight “new” activity, such as DNS names that haven’t been seen before. Perhaps I can find a way to publish that feed via RSS and encourage everyone to monitor new packages.